It’s Easier for Infrastructure to Be Invisible If It Uses Less Rack Space

 

Datrium Uses Less Overhead to Protect 6 Drive Failures Than Nutanix Needs to Protect 1

Datrium DVX, our on-prem DHCI system, is much better at efficient high availability than Nutanix or other DHCI systems. This functionality complements DVX’s existing superiority in performance, encryption, ransomware protection, and integration with our industry-leading DRaaS with VMware Cloud

These product innovations and our excellent customer support are all part of why Datrium has an NPS score of over 92

By contrast, with the emergence of vendors like Nutanix and other Hadoopy storage, Gartner estimates that IT storage utilization has decreased in general, from 67% in 2011 to 56% in 2017. They recommend that IT should replace inefficient storage mirroring with efficient, proven RAID6 and erasure coding. “This is especially needed,” they say, “for SDS and HCI solutions.” In the Nutanix solution, erasure coding is available, but it’s much slower, so it’s recommended only for cold data, and hot data uses mirroring.

DVX uses only RAID6-style erasure coding (8D+2P), with always-on global dedupe and compression. What about when you want even more drive resilience? Read on.

 

Background 

DHCI already has a better foundation for data high availability at scale than Nutanix. DHCI systems have an unfair head start: most of the nodes in a DHCI system, the app Hosts, are stateless and have no bearing on data availability. DHCI is already better for enterprise scaling, period. For a lot more on how Datrium exceeds Nutanix on a long list of availability techniques, read our whitepaper, System Availability Datrium DHCI vs. Traditional HCI. But even some DHCI systems, such as NetApp HCI, use mirroring and can only support 1 drive failure at a time.

 

Optional Data Node Fault Tolerance
Figure 1 – DVX offers the option of whole Data Node Fault Tolerance (DNFT)

Figure 1 – DVX offers the option of whole Data Node Fault Tolerance (DNFT)

 

DVX also offers the option of whole Data Node Fault Tolerance (DNFT). With DNFT, if the whole Data Node stops working (2 discrete motherboards, with dual power and networking, dual-attached drives), the pool continues offering data access. DNFT also increases drive fault tolerance (see Figure 1). 

Whole Data Node failure is quite unlikely. DVX’s off-host Data Nodes have dual-attached drives and dual active/passive hot-swappable motherboards. We haven’t had both motherboards fail simultaneously in a Data Node in normal use. But other vendors have had this happen in systems with similar connectivity, and some of their customers have been bitten by that. And, on rare occasions, the chassis itself could need to be swapped out, perhaps because of a bent pin. In a scale-out pool, such a failure or maintenance operation would stop the pool.

To guard against black swan failures, we offer DNFT; if customers choose this option, a whole Data Node can fail, and the pool keeps alive, serving data.

 

DNFT: How It Works and Why It’s Efficient

Here’s a quick background on the DVX write process: 

  • First, data blocks (checksummed, compressed, and encrypted) are written quickly by a Host to NVRAM on one motherboard of a Data Node, which mirrors it to a different Data Node motherboard’s NVRAM and then returns control to the Guest. 
  • Later, once the Host has composed a full erasure-coded stripe, stripe chunks with fingerprinted data are written directly from the Host to drives across the pool of Data Nodes, bypassing NVRAM. Read more about Automatrix Always-On Erasure Coding. At that point, data is durably protected on low-cost drives with efficient overhead.
Figure 2 - Increased drive FT  with DNFT option.

Figure 2 – Increased drive FT
with DNFT option.

Stripe writes. DNFT requires the physical capacity of 1 Data Node to be applied to resilience, hidden from application use (though it does add write throughput). After all, a whole Data Node may not be available. DNFT is an option for pools with at least 3, and up to 10, Data Nodes (see Figure 2). 

We also use this extra hardware to increase the minimum drive resilience to RF=4 (or FT=3, depending on your HCI training), but always with Erasure Coding.

This reserved resilience hardware means that in smaller pools, there’s a higher percentage of drives applied to DNFT. To put it to good use, in those configurations, we increase the amount of parity and drive fault tolerance for situations where a whole data node has not failed.

Figure 3 - DVX host software with always-on Erasure Coding

Figure 3 – DVX host software with always-on Erasure Coding

 

With DNFT turned on, drive fault tolerance increases to a minimum of 3 and a maximum of 6. In the HCI world, this would be described as FT=6 or RF=7. But even in that case, we use less overhead space for this drive protection than NetApp HCI with its single mirror, or any traditional HCI, which has unfortunate odds of data loss every year. 

Why low overhead for high availability: A DVX erasure-coded stripe has 8 data chunks. So RF=7 drive protection would require 8D+6P, less overhead than e.g., Nutanix RF=2 (1D + 1 replica). HCI hot data is mirrored, and they can add replicas to become more resilient; it just means you will always use a small fraction of the drives to actually store application data.

Sector read fault tolerance during rebuild. No matter how many drives can fail, the most common case for even more parity is when the maximum drives are down, and there’s a sector read error during rebuild. So we also insert additional parity within each erasure-coded stripe’s chunk to recover from that situation (see Figure 3). We’ve planned this architecture from the beginning, so when we rolled it out earlier this year, it took no extra space for existing customers.

It’s also important to remember that all drives and all Hosts play a part in rebuilds. The bigger the system is, the faster a rebuild goes. That’s unlike other DHCI systems, which don’t use Hosts as rebuild data movers.

NVRAM mirroring in DVX also breaks with the tradition of midrange arrays. For example, NVRAM doesn’t de-stage internally, so it doesn’t represent a bottleneck. Separately, NVRAM mirroring (transient mirrors during persistence) replicates over the LAN to a different motherboard (typically on a different Data Node), so data can be more resilient to chassis failure (see Figure 4). 

 

Figure 4 – NVRAM mirroring in DVX also breaks with the tradition of midrange arrays

Figure 4 – NVRAM mirroring in DVX also breaks with the tradition of midrange arrays

 

In a scale-out DVX pool, this is part of the systems-level thinking that enables a whole chassis to fail without the pool stopping, or in an extreme case, data loss. 

Does your DHCI vendor offer this functionality? If you’d like to see how we do it, just contact us for a demo.