Three Problems Solved or The RTO Zero Solution

I first became aware of the challenges of protecting data when NetApp recruited me to solve their backup problem back in 1999. I learned that protecting data had three key components: the backup itself in which all the data is repeatedly copied out; tape, the storage media for the backups; and the restore process of copying needed data back to repopulate the primary storage system. It turns out all three are big problems. And infrastructure managed this way needs performance primary storage, separate cost-efficient backup storage, and backup applications to manage the data movement between the two. My focus over the last nearly 20 years has been to solve these problems once and for all and simplify data management.

The Three Problems

As disk capacities grew and data sets grew to fill them, the load of repeatedly copying out data to back it up also grew. Disk bandwidth was not keeping up to handle the additional capacity. The inability to complete backups was limiting the practical size of a volume. Repeatedly copying data for backup is a problem.

Backup data can be twenty times the size of primary data because of the many full and incremental copies of data that need to be retained. Tape kept down the cost of storing all this backup data. And it was mobile which enabled backup data to be loaded on a truck and secured against a site-level failure. But tape was finicky and unreliable. Data needed to be streamed at the right rate. Some data sources couldn’t keep up, and others were bottlenecked. Getting data offsite was manual and so error prone. And tape was unreliable so restores could fail. Relying on tape for storing backups is a problem.

When data is corrupted or lost, there is often a user or many users waiting to get it back. Restore involves copying data back into the primary and this just takes time. Tapes may need to be brought back on site. But even if the data is local, restoring a lot of data is limited by the read bandwidth of the backup media and the write bandwidth of the primary storage. Everyone’s desired Recovery Time Objective (RTO) is zero, but restore copying makes that objective impossible.

Snap & Replicate

NetApp already had efficient snapshot capability built into their WAFL file system and this allowed a solution to the backup problem: let the snapshots be the backup. Instead of creating a snapshot so that backup can copy out a static, consistent image of the data, declare the snapshot to be the backup. Don’t back it up; replicate it to get it off site and onto more cost effective media. With WAFL, replicating a snapshot only required storing and replicating blocks that changed. A full copy was no longer required. Changed blocks are usually only a couple of percent of the total so replicating a snapshot reduces the data that needs to be moved by almost two orders of magnitude. With SnapVault, introduced in 2002, NetApp could stop the madness of repeatedly copying huge amounts of data out of the primary storage system.

SnapVault was not a complete solution, however, because the media it replicated to was not all that much cheaper than the primary: it was just fat and slow SATA drives instead of high-performance SCSI drives. Cheaper, yes, but not comparable to tape. Long-term retention of lots of snapshots could still be expensive. Furthermore, because the drives were low-performance, they couldn’t serve as primary storage and restores still needed to copy data from the SnapVault repository back to the primary storage system.

Compress, Dedupe, and Erasure Code

Finally getting rid of tape requires more cost reductions than just fat disks. Data Domain set out to optimize the storage of backups on disk sufficiently to eliminate the need for tape. It developed data deduplication technology that was sufficiently high-performance so that it could receive backup data without being the bottleneck. It added compression to the mix and also used a log-structured file system so that it could stream backups backups to wide RAID6 stripes and stay clear of the cost overheads of mirroring. This combination of technologies was very effective in displacing tape for storing backups and moving them offsite.

Data Domain was no more a complete solution than SnapVault, however, because it could not address the backup or restore problems. It could not support the random read/write workloads of primary storage. Data still needed to be backed up from primary storage and restored to primary storage when needed.

Application-Centric Snapshots

In the time since SnapVault was created, server virtualization, especially with VMware, has dissociated applications from storage management. It used to be that each application had it own volume in the storage system. When that is the case, snapshotting a storage volume effectively snapshots an application. With VMware, a single storage volume may contain data for dozens or hundreds of VMs and applications. Storage-centric data management like SnapVault does not address the need to protect and manage data for individual applications. Snapshotting and replicating a storage volume is no longer really a substitute for backup. It gets data off the primary system efficiently, but backup applications are still needed to deliver application-level data protection.

Tintri pioneered storage systems with application-centric snapshots and replication. But, it does not include the storage efficiencies needed for extended snapshot retention. Their snapshots, therefore, do not eliminate the need for backup and restore.

Many Hyperconverged Infrastructure products like Nutanix also include application-centric snapshots and replication. But their bundling of compute resources with storage and their inability simultaneously to deliver performance when the capacity optimizations of duplication, compression, and erasure coding are enabled, make them prohibitively expensive for snapshot retention.

The Trifecta Solution

The Datrium DVX is the first system that can both serve applications with world-class performance and protect them with world-class efficiency and application-centric data management. It is thus the first system that can eliminate the need for separate backup storage and backup applications. And in so doing, it can drive the RTO to zero.

To accomplish this, the DVX builds on nearly 20 years of technology development. It includes the following key technology and architectural features:

  1. Server-powered performance. In the DVX, hosts include flash and applications running on the host run at the speed of local flash. And every server brings its own performance so performance naturally scales as it grows.
  2. Always-on compression, deduplication, and wide erasure coding. For snapshots to be retained as the backup, they must be stored with all three of these capacity optimizations. And simplicity demands that these optimizations not have to be managed, and instead can just be on all the time. These technologies let snapshots stay resident right there in the same system serving production applications.
  3. Application-Centric Snapshots. The DVX can snapshot individual VMs, directories, Docker persistent volumes, even individual files. It can group any of these and create app-level multi-VM consistent snapshots. Some of these capabilities are beyond what even backup applications can provide.
  4. Split Provisioning. The DVX stores snapshots with active data in a scale-out set of dedicated data nodes. These nodes allow expansion for both active and snapshot data without having to add expensive compute resources at the same time. This also contains the cost of snapshot retention and simplifies management.
  5. Instant Restart. Because snapshots are stored within the primary storage system, no restore is needed before an application can be restarted from a point in time before the corruption or data loss. And in the event of a site failure, the replica DVX can bring applications back on line without waiting for a restores to complete. Data protection snapshots and disaster recovery can be handled by a pair of DVX systems.

In a DVX world, snapshots are the backups, not a frozen image to be backed up. Snapshots are retained right there in the primary system so data does not need to be copied out by a separate backup application to a separate backup repository. And the snapshots are readily available for instant restart without the need to do a restore first. The snapshots are replicated to a remote DVX to protect from system and site failure. And the snapshots stored there are available for instant remote restart, for disaster recovery, or to move a workload from one site to another for other reasons. There’s no separate backup application and no separate backup storage infrastructure. Simple.