Unmounted NFS datastore after ESXi boots

By | January 17, 2014

In my home lab I recently found a problem where an NFS share I had mounted on my ESXi 5.5 device was unmounted after it had been turned on/restarted. My home lab is not that unusual, I have a N54L HP Microserver for shared storage – where I am using ZFS to carve up my 4x1TB disks – and a couple of Intel NUC’s for compute. All of this is joined together by a Cisco SG300-10 gigabit switch/router. I’ve been putting together some articles for this blog about a home lab and was curious as to the performance difference between running Windows Server 2012 Standard and CentOS 6.5 with ZFS for the shared storage. So I’ve been adding and deleting a lot of NFS and ISCSI datastores on my ESXi hosts. But since moving to CentOS to try out ZFS I’ve had a problem where one of my two compute devices will refuse to mount any shared VM exports. It will mount my ISO repository (which has slightly different permissions to allow anonymous SMB access), but nothing else. My other ESXi unit will mount everything.

So, knowing that one device has no problems, the permissions difference between the ISO repository and the shared vm’s can be eliminated. We can also discount the NFS permissions for root and read/write access as I am able to manually mount and use the datastores, it’s only on boot they are unmounted:

nfs-unmounted-vsphere-client

First thing I tried was looking at the /var/log/messages log on the storage. This is what I would see:

nfs-mount-only-isorepo

The ESXi device is not attempting to mount the share – confirming there is no problem with the storage server. So, turning to a recent purchase “Troubleshooting vSphere Storage” by Mike Preston (it was on offer at the start of the year and is a very good read), the section about troubleshooting NFS datastores tells us to look in the hostd logs. Let’s take a look:

nfs-unable-to-resolve-short

“lb-fs-01” was the name of my Windows 2012 server (and is the hostname of the current storage server), but it’s not in DNS. In addition, all the datastores I’ve set up with the CentOS box have been defined by IP address:

4-ds-after-reboot

So, it is strange that ESXi is trying to resolve “lb-fs-01”. For the answer, we’re going to need to look at the configuration – /etc/vmware/esx.conf. This is the point in the article where I wish I had a screenshot of the offending entries but I didn’t realise that ESXi purges /etc/vmware of all additional files on boot. So, sadly, there is no picture of what was causing the problem.

To find out what NFS storage entries your ESXi has, type the following from the console/ssh:

/etc/vmware # cat esx.conf | grep "/nas/"

This will display all your NFS datastore entries (and a few other lines). A working NFS entry looks like this:

esx.conf

There are four lines to define the storage – however there were only lines pertaining to lb-fs-01 and I’m only sure of “host” and “share” being present. I would imagine “enabled” would be the line that was missing.

Unlike me, you should take a backup of the file via WinSCP or FastSCP and not take a copy to the local file system. Once you’ve got a backup, delete the lines for the non-existent export/datastore and restart (“# reboot”). When the host came back up, I added one shared VM datastore, restarted again and whilst booting this is what I saw on the storage server:

storage-messages-success

Success :). After logging into the vSphere Client, I could confirm:

vsphere-client-success

I have no idea why those entries where still in my esx.conf file. I deleted unused and non-existent datastores through the vSphere Client in the normal way and don’t recall any errors. It’s very strange as to how or why this occured.

That being said there is a main lesson – when backing up from ESXi, use WinSCP or FastSCP 🙂