The Correct Way to Use LSM-Trees

I recently read a couple of Chris Mellor’s articles discussing the virtues and vices of LSM-Trees (here and here). There is also a blog post by the folks at Confluent on this topic. Chris correctly identified the issue of write amplification with some of the open-source key/value stores that use LSM-Trees. Having studied and implemented LSM-Trees extensively over the past 10+ years, I wanted to share some learnings to add to this discussion.


The Good and Bad

LSM-Trees were first proposed in 1996, and they can be used for storing key/value data. It converts random key/value insertions into sequential writes to the underlying media, which is really good for HDD and SDDs. Google started using LSM-Trees for its SSTables, and that made it more widely known. It is now used by many open-source NoSql databases, like Cassandra, LevelDB, etc.


The Good 

LSM-Trees really work well for high-speed random insertions of key/value data into a database. It belongs at the top of the family of write-optimized data structures. The other important property is that it is very easy to predict the behavior under various input conditions. It converts a very complex problem into a simple solution.


The Bad

The perceived downside is that there is a periodic compaction that needs to run. Compaction by itself is not a big problem, but some things exacerbate the issue. One is that the compaction increases the write amplification on the underlying device. However, most data structures will have some level of amplification, but the issue with traditional key/value stores is made worse for the reason explained below.


The Solution – Separate the Data from Metadata

The crux of the issue of the various open-source implementations is that they store ALL data in an LSM-Tree. This approach causes an explosion of write amplification during compaction, which then consumes an enormous amount of resources.

Imagine you are building a database of photos, with each photo having a small unique identifier. The identifier will be the key, and its photo will be the value in the classic key/value database terminology. Inserting new photos will work really well. But, because the key and the value are stored together in the database, it will cause a huge amount of compaction-related amplification even though most photos are not changing at all.

All of this can be solved with one simple idea. Just store photos (the data) and the keys (the metadata) separately. The metadata tends to be tiny compared to the actual data. That’s what we have done at Datrium to build a PB-scale dedupe storage system. We use LSM-Trees to store the dedupe index, which is tiny in size when compared to the actual data. Because the metadata is so small, the write amplification does not matter. We reap the benefits of LSM-Trees without paying a huge price on the downside.

We store the actual data itself in a log-structured filesystem (LFS). This is another data structure idea that was invented in 1992 (by Mendel R., who also happens to be the founder of VMware). Similar to LSM-Trees, it converts all random input writes into sequential writes to the underlying device. We feel both LSM-Trees and LFS have incredible properties that simplify a lot of things in a storage system at scale and yet operate at very high speed. Here is a small write up that explains the beauty of LFS.


LSM-Trees have other useful properties

Besides using it for a dedupe index, we have found other use cases where it adds value. With storage system snapshots, one also needs metadata to manage the snapshot data. We use LSM-Trees for the snapshot metadata to create versioning trees to get fine-grained snapshots, and we are able to store and handle millions of snapshots.

Diff’ing across snapshots is a common need in store systems (e.g., for replication). LSM-Trees offer a high-speed diff’ing mechanism because the diffs translate into sequential reads. We are also able to diff in reverse time order of snapshots, which provides a more robust business continuity solution.

Finally, LSM-Trees and LFS are well suited for the cloud world as well because they can thrive on AWS S3 like object stores. AWS S3 is very good at sequential I/Os but weirder for random I/O patterns. The log-based data structures really shine on S3.


Different Structures Have Different Tradeoffs

Overall, there are many different types of data structures to choose from when designing a storage system or a database. Each structure has its own tradeoffs. We have found that using LFS and LSM-Trees together (with our own flavor of algorithm innovations on top) can be extremely compelling to thrive on many different kinds of devices (HDD or SSD) or in object stores in the cloud (AWS S3). These two data structures future-proof the core storage system design.