SSD storage is all the rage in big data today. It is solving a lot of high IOPs problems at the same time introducing new challenges. Its price point is in range of replacing 15K SAS pools when large numbers of IOPs are needed. The dollar cost per IOP has never been better.
XNAT pipelines and VMware datastores continue to push our need for IOPS. So I have been tasked with solving the IO problems with an SSD based ZFS server build.
With the addition of SLC SSDs for ZIL and MLC SSDs for L2ARC on our existing ZFS server it has done very well with only 40 NL-SAS spindles handling 160+ VMs, but when memory pressure on the VM cluster starts pushing things to swap, it can no longer keep up and complaints about performance start rolling in. The BlueArc and ZFS server have not been keeping up with VM storage and build storage for the large processing jobs for the Humanconnectome Project (HCP) so more IOPs are on order and SSD is our solution.
Besides IOPs we wanted to also increase our availability of our ZFS pools and be able to maintain our servers without downtime of the pools.
I had first planed to use existing heartbeat tools and write my own failover scripts. This got ruled out because of the risk that unknown bugs in the scripts could potentially cause unplanned outages instead of increasing availability. We decided it would be better to license a well tested and trusted solution instead of risk unplanned outages. Enter RSF-1 for ZFS from High-Availablity.com.
The SSD Server
It’s almost a drum beat on the ZFS related mailing lists, that when building a pool to only use SAS drives if you want reliability to never use consumer SATA. That is a real problem when trying to purchase multiple terabytes of SSD on a tight budget. When consumer MLC (not TLC) cost about $1 per GB and SAS SSD cost $4+ per GB it becomes hard to justify. There are lower cost examples of each from brand X, but every thing less seems to come with a bad reputation of failure and wasn’t even a consideration.
One of the known problems with consumer SATA is long to extremely long times to return an error. The other problem discussed on mailing lists are reset storms when connected to a SAS expander. The long time to return an error is related to many retries to read corrupted data on a disk platter. Enterprise drives, simple return an error and let the redundant file system find the data elsewhere. Consumer drives, since there is rarely redundancy, will try their mightiest to get the data and will take extremely long times to return an error from a read operation, stalling the entire file system. Obviously not a problem on SSD, no platter to keep trying to read, they will return the error immediately if they have a read problem. As to the reset storm problem, I have not personally witnessed one, so this risk may still be out there with SATA SSDs.
The choice of SSDs is a bit complicated. We currently have three defined purposes for this server. Host our production VMs, our development VMs and provide build space for the HCP pipelines.
For the production VMs data loss, complete failure or downtime must be avoided, so the 800GB Intel DC S3700 SSD were chosen to build a small pool for them. We ordered 7 of them which are back-ordered until sometime in September. Unfortunately these “enterprise” SSDs are still SATA. They were chosen because of consistent performance, life expectancy and price per GB, in spite of the SATA interface.
For development VM storage and build space the 512GB Samsung 840 Pro SSDs were chosen. We purchased 63 of them. Consumer SSDs come with some very serious thorns that will bite you when using them on a production ZFS server.
- No super-capacitor. All modern SSDs use a write cache to help with speed, wear leveling and garbage collection. The problem with consumer SSDs is if they lose power in the middle of a write, even a ‘sync’ write, they will lose data and possibly the entire contents. Enterprise SSD protect this ‘written’ data that is in memory with a capacitor that reserves enough power to flush the writes to flash.
- Small to no over provisioning. ZFS will used the entire SSD as presented to the system and does not support trim on Solaris based distributions. This causes a big write penalty once the entire SSD has been written to. The only remedy at this point is running a secure erase to re-zero all sectors.
The lack of over-provisioning is fairly easily overcome, by artificially over-provisioning by slicing (partitioning) the drive to only use 70-90% of the available storage. The 840 Pro has a good garbage collection routine and 80% provisioning works very well to maintain write performance.
The lack of super-capacitor can only be dealt with by aggressive backup policies, a UPS and luck. So no production data will reside on this pool.
Building for High Availability
Attempt 1: Connect all the SATA SSDs to the SAS expander/backplane and use a SAS switch to hand over the pool for a failover. This plan failed rather quickly when I realized that more often than not, when a SAS expander is disconnected hot from an Illumos based system it will panic. The cut over time was in the order 30 seconds+ for the SSDs as the receiving system scanned all the newly attached drives. So this solution was not going to work for high-availability.
Attempt 2: Order some interposers and determine if they play well with the SSDs. Interposers are talked about as a cheap hack on most the mailing lists I follow, so this approach seemed a bit risky, but I knew LSI had been working on them recently so considered it a worthy trial. So far this gamble seems to be paying off. With interposers installed on 15 SSDs, I have two servers that can talk to them at will without a single blip. I’ve been through many performance tests and fail over tests without issue.
RSF-1 for ZFS
I’ve know about RSF-1 ever since I first experimented with Nexenta and knew they sold solutions for many types of high-availability solutions besides ZFS. So I contacted them through their website and requested a trial and pricing to make sure their offering wasn’t outside our budget. They were right on target. Out of respect, since they don’t publish their pricing I will not either.
For testing their software I setup OmniOS on our SSD server and a VM with an HBA via hardware passthrough. We arranged a time for them to connect to our servers and do the initial install. They have prebuilt Solaris packages so the install was rather painless. They provided a nearly complete config. All I needed to workout was the network configuration I was to use.
There were a couple hiccups, because this was a pure SSD system and I configured the pool in a non-standard way.
RSF-1 uses several strategies to determine when to failover and to safely fail over a ZFS pool. It can use any combination of network, serial and disk based heartbeats to determine if node members are alive. It also uses SCSI reservations on the disks to prevent dual-headed ZFS pools.
The first hiccup was the reservations set in the config provided. They were not aware that one of my servers had two HBAs and setup multipath to the SSDs. This caused system panics with the initial version they installed. When I alerted them to this issue they quickly got me an updated version that fixed the problem.
The second hiccup was caused by the disk heartbeats that were setup. The pool was configured with an ashift of 12 for the SSDs to reduce the read/modify/writes that would be happening with 512b sectors. When I told them about the problem, they said I would be better off installing a couple spinning disks for heartbeats or use a serial cable for an out of band heartbeat. The reason for this is the low level writes could end up bypassing the wear leveling on the SSD and cause premature failure.
With those issues behind me, I’m on to testing. I have configured a zpool and ZFS folder and NFS mounted it to our vSphere cluster. I performed failovers while doing each of the following:
- Deploying a VM with puppet.
- Storage vMotion a VM to and from the pool.
- Suspending a running VM
- Powering on a suspended VM
- Accessing the web interface of the VM.
In most cases the failover could not even be noticed. If were noticeable it was only about a three second delay in response.
For some of the tests I had the pool on the virtual OmniOS system and powered off the VM. Again RSF-1 was quick to respond and the failover was not even noticeable.
The configuration is straight forward, once I learned the details I’m actually surprised that High-Availability insists on doing the install. A little more documentation on their part and it could be a no brainer to install on your own. However, I don’t know with specialized software like RSF-1 if that helps or hurts sales.
At this point I’m requesting an official quote so we can order the software. I will follow up with details of installing OmniOS on our existing ZFS server and marrying it to our new SSD server with RSF-1.
Breaking the law!:
Fast, Cheap, Reliable: Pick two.
With 70 SSDs this system is unbelievably fast. It is relatively cheap for approximately 20TB of useable SSD storage. Okay lets keep fingers crossed on reliable. Several corners of best practices were broken for reliable enterprise ZFS storage, but considering the majority of this server’s use is high speed scratch pool we should okay. We will add a second SAS switch and eliminate the all the hardware single point of failure soon. The only foreseeable gotcha I may still have out there is a reset storm on a SAS expander/backplane that takes down a pool.
RSF-1 will play critical role in our ability to update hardware and software, even change ZFS operating systems if we deem it necessary without downtime of our ZFS pools. I don’t expect it to be the magic bullet to make consumer SSDs as reliable as enterprise SAS SSDs. I have had to live not doing updates on our ZFS box for many months at a time to find a long enough window to service it. That will be a thing of the past once RSF-1 is implemented. I will simple move the pool to one server and perform the maintenance at any time. One less reason to come into work on a Saturday or Sunday!
The Nitty-Gritty Hardware Details
For those of you looking for the parts I used in this build, here’s the raw parts list.
||Supermicro 2U heat sink
||Supermicro system drive mount
||Xeon E5-2643 (4C 3.3 Ghz)
||Intel dual 10gBe NIC
||2-port External to Internal iPass
||10 GBE SFP+ twinax cables
||Intel 320 Series SSD – 80 GB
||SSDs Intel S3700 (800GB)
||SSDs Samsung 840 Pro (512GB)
||LSI Interposer Card
||LSI 6160 SAS Switch
||Spare power supply
||SAS Switch Shelf
||2.2ft External SAS CBL-0166L
Update Feb 2014: Pick two holds true
Looks like the law has caught up to us. We got two out of three, fast and cheap. The interposers are proving to be a breaking point. Out of over 80 interposers used, there have been two failures. Each time the interposer failure has caused the entire pool to become frozen. Both of the failed interposers were used in the L2ARC on other systems. It appears that the mpt_sas driver tries repeatedly to reset the interposer and eventual resets the entire SAS path including the driver. This leads the pool becoming inaccessible. Rebooting the server caused the failed interposer and SATA SSD to be kicked offline. If the device would properly be kicked from the pool in Illumos the reliability may still be there.
To date the Samsung SSDs are the only SSD to work for me behind interposers. Intel and Micron both have failed to initialize and talk to the system. They are also proving to be reliable and fast. Granted their usages is still less than a year, out of over 80 of them, I haven’t seen a single error and their speed has not degraded.
The SAS switch also is not getting along with the Supermicro JBODs. Making changes on one system often causes a cascading problem where devices in the JBOD go offline and don’t return until a power cycle. I suspect this is a problem in the Supermicro JBOD as I’ve seen lots of odd problems like this with their JBODs. I’ve come to the conclusion that for highly available production system Supermicro JBODs should be avoided. My current best choice is DataON. They do cost significantly more, but in the total cost picture they are not a bad investment.