Achieving Data Integrity: A Journey into Paranoia

I used to work at a networking company that built switches. Before that, I worked for a database startup.  Databases have strict consistency requirements where DB transactions must be correctly persisted. They need to ensure that the ACID properties are preserved at all costs.

It was an interesting experience for me to observe the exceptional contrast in the rigor around software reliability between the two companies. Networking companies take for granted that lost network packets will always be re-sent and routes will be rediscovered by probing networks. Database companies are much more careful to correctly save transactions safely on disk.

Then, I joined DataDomain, and one of my early adventures there was to build all the RAID layers for their product.

My journey to paranoia

After some initial coding at DataDomain, I started to become paranoid about data loss. Unlike networking, packet losses in storage results in data loss. After many sleepless nights, I began devising strategies to deal with it, and I asked myself what would it take to have zero data loss.

I even went down the path of trying to figure out if I could mathematically prove that the code I wrote was 100% correct. I soon figured we would run out of money before I could formally prove this. So, I did the next best thing I knew, I started to methodically design checks, double checks, triple checks, and “logical” checks to ensure that the code I wrote would always work under any condition. Looking back after all these years and shipping 100,000 systems, that approach worked quite well.

Paranoia as a strategy

There’s no doubt we are paranoid about many things at Datrium, customer satisfaction, the competition, product strategy, and most of all, data integrity. It is good to be paranoid during the day so that you can get a good night’s sleep. It is also beneficial that our engineers are similarly paranoid. For example, some of them had previously built VMware’s core hypervisor where executing an instruction in the wrong way means application or server wreckage.

We have taken data integrity to another level at Datrium.  It all starts from a rigorous design process where we challenge each other to prove that data integrity is going to be maintained under all conditions. We also wanted a simple and elegant filesystem architecture that makes it easy to reason about things. Additionally, the system has rich data management features, and very high performance.  This is part of what we believe makes us Tier-1.

Below you will find some select examples where we have gone beyond traditional methods for data integrity. There is no one single method to ensure complete integrity, and so we have employed a multitude of techniques to ensure data safety.

Continual filesystem verification of the “entire” data in the system – This goes above and beyond just having disk checksums and disk scrubbing. This filesystem verification checks the referential integrity of the entire data several times a day with negligible impact to the overall system performance.  This is somewhat unique in the industry. The use of crypto hashes makes the continual verification to be quick, easy and complete. Kind of like Blockchain.  This verification is always on, including during development/testing.  It provides us with significant confidence in the system architecture.

No overwrite Log-structured filesystem design –  After existing data in a system is already checked and verified, interestingly, the next biggest threat to data integrity is new writes. Because new writes will end up going to disk and might corrupt old data and metadata, we use a log-structured filesystem design to never overwrite and only write in new places, like a log. This is made possible because we use erasure coding with full stripe writes.

No knobsWe have an internal design goal of not having knobs. It leads to a less complex system.  Imagine having knobs for dedupe, compression, checksums, RF2 vs RF3, erasure coding, etc.  Not only does it confuse the user, but it also ends up complicating the filesystem architecture.  It also places a big burden on the poor QA team which needs to test all the possible combinations of knobs.  If these combinations are not tested properly, then some combination is being tested by the customer for the first time in the field.

There are many examples of unique and interesting things we have done in the system to provide data integrity for customers data. This comprehensive white paper demonstrates how Datrium engineers went about building data integrity into the Datrium DVX product.

It is vital for our customers running enterprise applications on Datrium to know that their data is safeguarded and monitored with the most advanced technology possible. That’s why we we’re paranoid and spend significant time and investment towards maintaining data integrity in our product.