Delighting customers with a system that is super simple to use, works great at scale, and is very cost efficient is a heady prospect to be sure. However, if you listen to our customers, Datrium has made “delight” a reality with:
- Massively scalable performance for primary storage
- Data efficiencies and copy data economics for secondary storage
- Being easy as hell to use and high reliability
But how do we actually achieve DVX’s (the name of our solution) unprecedented performance, data efficiencies and reliability? A filesystem foundation designed with the following built-in features:
- Mind-blowing performance
- Distributed and scalable object store
- Always-on inline compression, deduplication for HDDs and SSDs
- Always-on distributed erasure coding
- Use of HDD and/or SSDs as persistent stores
- Massive host side flash caches
- Millions of snapshots with fine granularity
- Predictable behavior
Solving for three variables solves everything else
Interestingly, solving three important problems solves all others.
Problem 1: Handling Random Writes with low latency using HDDs & MLC flash
VMs do random I/Os. Random reads can be easily addressed with flash SSDs. However, random writes are a completely different story. It is obvious that spinning HDDs cannot handle random writes very well. However, even MLC SSDs have difficulty dealing with random writes. Write-optimized enterprise SSDs are better at random writes, but then the cost goes up significantly. So how do you solve this random write performance with good latency for HDDs and low cost MLC SSDs?
The other issue for random writes is the data protection strategy. Traditionally, RAID-5/6 or Erasure Coding layouts are problematic because random writes require read-modify-writes of entire stripes, resulting in huge I/O amplification and horrible performance. Hence, most HCI vendors choose RAID-1 type mirroring as the default. But, mirroring increases the cost of acquisition, and the cost of processing the same I/O in multiple nodes. Inline Erasure Coding is the best in terms of efficiency, but how do you make this work in a distributed system?
Problem 2: Handling Random Reads with low latency
Random reads can be solved by using SSDs. But, how does it work with a combination of HDDs and SSDs? And how do you get low latencies without having costly SAN controllers, and avoid the Little’s Law effect? Read latencies are generally much more perceptible to end-users than write latencies.
Problem 3: Handling variable block sizes with Inline Compression
The other very interesting problem is inline compression. Compression is valuable to have because most workloads get 2x to 3x space savings. “Inline” makes it better because, when done on the host, it saves both network and disk resources, etc. Why do systems have a hard time implementing inline compression? Why is there a checkbox to enable this, and why is it OFF by default? Because there is a catch to enabling compression.
VMs generally do I/Os in 4KB blocks (or multiples of it). Without compression, the 4KB blocks can be written to disk as it is. If the block is deleted, then it is easy to mark that 4KB block as free, and the free space can be used later. If the block is overwritten, it is easy to overwrite it on disk. This is how the ext3 file-system works.
However, if you compress a 4KB block, it might end up becoming 2.52KB in size. Disks these days want 4KB aligned I/Os. So, the 2.52KB compressed block will have to be saved in a 4KB block on disk, thus voiding the space savings. Let us say you somehow managed to save this 2.52KB somewhere on disk efficiently. The VM can overwrite that location with new data, and the new data can now compress to 2.73KB. This is slightly larger than the previous 2.52KB, and so where do you store the extra bytes now? This is hell to manage. Basically, the problem with compression boils down to how to manage space allocation on disk efficiently in the face of variable sized blocks.
Log-Structured Filesystem (LFS) to the rescue
Let me introduce the Log-Structured Filesystem (LFS). It was originally designed by Mendel Rosenblum and John Ousterhout at UC Berkeley (Mendel is the founder of VMware, and also an investor in Datrium). LFS started becoming popular when Google (with the Google File System), DataDomain, etc., started using it to build distributed systems and deal with variable sized blocks. More recently, LFS became indispensable to SSDs—this is how they manage flash chips internally. Below is the gist of how Datrium implemented a proprietary scale-out version of LFS. It is actually quite elegant.
When the DVX gets a write request from a VM, the write data is deduped and compressed (and checksummed) right away, and stored in multiple nvram locations on data nodes. Compression and checksumming are performed by the hosts issuing writes before any network transfer. The nvram writes are super fast. As soon as the nvram writes are done, the write request is ack’ed back to the VM. So, the write latency from the VM’s perspective is fast.
In the background, an 8MB “container object” is built on the host in memory. Each of the compressed writes is added to the container as a log. All the compressed writes are tightly packed into the container. Once the container is full, it is sealed, ECC codes computed, and the container with ECC is written out as erasure coded stripes to data nodes. The drives on the data nodes collectively form a Distributed Object Store. The container object is split into N data chunks and 2 parity chunks to tolerate 2 concurrent drive failures. It is easy to add in more parity chunks in the future if there is a need to tolerate more concurrent drive failures. Adding one more parity will add about 1/N capacity overhead.
Each chunk is written to a different drive across multiple data nodes in the DVX system. Once sealed and written out, a container object is immutable and never changed—just deleted. New incoming VM data is written out to new containers. When VM data is overwritten by the guest, the old data is no longer needed. Unused blocks are then reclaimed using well known space reclamation techniques, such as reference counting, mark-and-sweep and/or generational collection, further enhanced with Datrium’s secret sauce. So there you have it—a quick high level description of Datrium’s LFS (figure 1).
What about VM reads? As writes are packed into containers and written out to persistent storage in the data nodes, a copy of the data is also placed in the local host flash with a log-structured layout after deduplication. Because of typical compression/dedupe ratios of 2x to 6x, the raw flash on the host becomes effectively bigger. A 1TB local flash will look like 5TB, making it cheap and abundant. At that scale, the host flash is no longer a cache for just hot data. All VM data can be cached in the local flash of the host where the VM runs. The VMs can see sub-millisecond latencies for their reads, and avoid interference from other hosts as well as latencies of the network. All reads are locally served, and the data nodes are used as a sequential backing store. There is more on this topic described here.
Problem 1 solved:By packing incoming VM write data into large containers before writing to persistent media, random I/Os are converted to sequential I/Os when they go to disk or MLC flash.
Problem 2 solved: Local caches on the host are large because of dedupe/compression. All reads come from these local caches, which makes reads faster and scalable with very low latency.
Problem 3 solved: The packing of variable sized writes into containers before writing to persistent media eliminates holes and enables space savings.
LFS is like Santa Claus
LFS doesn’t just stop with solving inline compression and random writes; it keeps on giving. There are a lot of other significant advantages including:
Good for low-cost SSDs, and HDDs
The DVX coalesces VM generated random writes to large sequential I/Os to the DVX drives. This is very important for both MLC flash SSDs and HDD. MLC flash works very well for random reads, but works poorly for random writes. Random writes to MLC flash results in performance issues and quicker wear-out, especially if you want to use cost-effective commercial grade SSDs. Enterprise drives are more effective at random writes but they are not cost effective. LFS’s trick of turning random writes into large sequential writes transforms a low-cost SSD to behave as well or better than the costlier Enterprise SSDs. And, of course, the other well known thing about traditional spinning-media HDDs is that they are very bad at random I/O, but very good a large sequential I/Os. So, at one shot, LFS has made it possible to use low-cost SSDs and HDDs.
Distributed “Inline” Erasure Coding
LFS enables the creation of sealed containers, and these containers can then be stored using any one of the popular techniques. It can be stored in 3-way replication or erasure coded across nodes/disks. However, it is better to just do erasure coding because it is much more space efficient and resource efficient. And it is easy to do that with sealed containers. Using sealed containers completely removes the read-modify-write amplification penalty for dealing with VM random writes. And erasure coding can now be done fully “inline”.
Every VM write (random or sequential) is converted to large sequential I/Os to disk. Performance does not get any better than writing large sequential I/Os to SSDs or HDDs. This is the most ideal I/O pattern, and this is exactly what LFS produces. This kind of sequential I/O pattern to disks also behaves very predictably, and has an easy to understand behavior model.
Because all containers in LFS are written in their entirety, there are never any read-modify-writes in the DVX. The highly important invariant in the system is that no container is ever modified after it is written to disk. All new data goes into new containers (and into new stripes). This is a very nice invariant to have for data safety. However, this is not true if erasure-coding is employed without having an LFS implementation. The danger of doing read-modify-writes is that the system will run into performance issues, or more sadly, the system can run into a data corruption situation called a write-hole after a power failure.
Because the DVX always writes new containers, and never overwrites an existing container, it lends itself naturally to be used for ROW (redirect-on-write) snapshots with great space efficiency. This is why the DVX can support millions of snapshots very easily.
LSM-Trees for Dedupe Index
Once there is a stable container based object store, there is now a need to store the dedupe index safely on disk as well. The DVX uses LSM-Trees (log-structured merge trees) to store this index, with some additional secret sauce. Think of it as a high performance key-value database. Even the LSM-Trees are stored as containers in the LFS filesystem. Additionally, LSM-Trees bring along their own nice goodies to the table, but that will have to be described in a separate article by itself. By combining the dedupe index with the container based object store, we get an inline compressed Dedupe Store.
An efficient deduped object store, built on top of a LFS filesystem with distributed erasure-coding, provides a very high performance and robust foundation for the DVX. Any new feature can be easily built on top of this without having to change the foundation. The sky’s the limit.