This morning was one of those mornings IT people dread. I arrived to work and saw a long list of alerts in my email. All the alerts were from the VMware cluster complaining about various host connectivity problems. After a bit of digging I found half the VMware hosts were no longer connected to the mounts on our ZFS backup and development server.
Attempts to restore the NFS functionality failed, so I decided to reboot the server. (It has been about 9 months since the last reboot). Wouldn’t you know it, it didn’t come back up. Grab the laptop and head to the server room.
When I first hooked up a KVM, I happened to visually catch the Solaris kernel crashing and rebooting. On the next boot it appeared to be hanging in the boot, but I wasn’t sure. Several boot attempts later and some Googling, I got it to boot in kernel debug mode and saw it crash.
Still not sure if it was a hardware or software, I tried to boot a live CD to see if I could import the zpool. This also hung up during boot. I’m pretty sure it’s a hardware problem now.
VMware to the rescue. We don’t have any extra servers around here that can handle a Solaris OS, so I deployed a VM and used VM DirectPath to map the HBA to the VM. Reconfigured the network on the VM to match the down system and imported the pool. Walla! All my NFS mounts were back in service.
I spent a couple more hours configuring up the VM to provide all the functionality downed system.
Now to fix the hardware.
I think I’ll be requesting another HBA to keep in one of the VMware hosts. This way I can dual connect the SAS pool and have a hot stand-by VM. All this for a system that was originally built as a backup server, but has evolved to a production development server.