There are two ways hosts can increase IO versus a traditional external array model:
- Offloading, by serving IO locally and saving a network hop;
- Acceleration, to reduce pressure on external resources. Examples of this are compression and bundling large writes vs. lots of small block transactions.
In the DVX, we offload reads physically via massive caches per host. The DVX also accelerates reads, but offloading has the biggest effect.
But we accelerate writes. This allows hosts to be pure performance resources for their storage, separated from the issues of persisting their data.
The DVX approach scales naturally with the number of hosts as a multiplier on the native bandwidth of a NetShelf. For example, if 90% of IOs are serviced by the host, comparing a NetShelf to an equivalently configured array,
- A NetShelf would need 10x less internal bandwidth to support the same number of workloads;
- A NetShelf could have 10x more workloads in front of it.
There is an ultimate limit to the write bandwidth of a single NetShelf in the first release of the DVX. The DVX enables linear scaling of compute and reads while keeping hosts stateless, and hosts accelerate their own writes.
Most DVX Host IO is Serviced by the Host Itself
In Enterprise server VMs, most IO is read IO (with notable exceptions). As noted above, host read caching frees up most of the total bandwidth needs otherwise provided by an array.
How about writes?
- Inline host compression. Compression is a CPU intensive operation and scales naturally with the number of hosts.
- Checksums and referential integrity checks. All written data is checksummed on the host inline. The host also prepares the data for future referential integrity checks. These are also CPU intensive operations that scale with the number of hosts.
- Inline deduplication. Host caches are fully deduplicated inline. Some stages of NetShelf dedupe will be done on the host inline, and these stages will also scale directly with number of hosts. Remaining stages are done on the host too, but after the initial persistent write.
- Disk writes are unblended and coalesced by the host up front, so there are fewer, larger IOs. So the NetShelf’s collection of media will not be burdened by write IOPS demands. Random writes go to NVRAM on the NetShelf, random reads go to host flash (which is fast/cheap for this).
- RAID Error Correction Codes are computed on the host. So the NetShelf only has to transfer the new writes to NetShelf disk.
Why not just offload writes to a Host Writeback Cache / Tier?
An alternative approach is writeback caching, where some data is held back from the durable store within a host offload pool.
Because the array resilience functions can’t protect the withheld data, these approaches typically use replicating pools of hosts, mirroring to each other.
This has problems:
- Support fingerpointing. Holding back writes creates two tiers to administer. It’s especially bad with a 3rd party cache, not fully integrated with the array. In this case, there are two vendors, two admin tools, two groups to finger-point when there’s a problem.
- Higher latency writes. To get resilience in the host pool, the hosts need media for the writes that can withstand power loss. Most vendors use SSDs, and they are slower for writes than the NVRAM in a NetShelf.
- Neighbor noise. Pooled writes affect performance of hosts that are not running the originating VM and intensive load on the replica hosts can add latency to other hosts’ writes. This neighbor noise makes troubleshooting and tuning harder.
- More expensive flash. Replicating across host caches carves up each host’s flash allocation, using some of it for other hosts’ replicas, and making it all more expensive. As it turns out, most 3rd party host caches also need to start with more expensive SSDs with higher TBW ratings for longer endurance.
- Cache flush delay and scripting. To do a consistent off-host clone or snap, you have to flush the host pool cache first. At even a nominal delta size, this can cause noticeable delay and administrator annoyance.
- Complex and expensive rebalancing of loads on host power-off or crash. Compute resources are transient by nature: hosts often move between clusters, enter maintenance mode for software upgrades or simply get powered off (for example automatically by VMware Distributed Power Management). In all these cases, to avoid compromising durability guarantees, all writes withheld by the unavailable host must be synchronously re-replicated before any new IO could be handled. To avoid such hiccups, 3 or more copies of cached writes must be maintained on different hosts to approximate RAID6 levels of durability
The DVX avoids these problems by keeping durable data off-host.