OCI is to HCI as HCI is to SANs


Sometimes people ask me whether Datrium is HCI.  That’s like asking whether HCI is primary storage.  It is, but it’s more and less than that, and different as a result.

Datrium has an HCI-like subsystem for hosts.  Like HCI, we solve the SAN controller bottleneck/upgrade problem and scale speed automatically with hosts.  We also do some things much better for the high end, such as stateless hosts and scalable low latency for tier 1 apps thanks to split provisioning, and we let you use any leading server and host SSDs.

But more important is what we do that’s truly different.  In Datrium, VM backup and cloud data management are fundamental parts of the system design, with the always-on data cost efficiency of a scale-out backup vendor, in the same system that provides the ridiculously high performance you wish you had with your AFA.  This way, you don’t need two systems.  Infrastructure becomes much simpler and lower cost.

In the industry landscape, HCI and backup are actually pulling farther apart.

Backup vendors worry about low $/TB, and latency is less critical.  Backup is insurance; it can’t cost as much as the thing you’re protecting.  Purpose-built backup systems focus on streaming IO and low $/TB, so e.g. their systems have big 7.2k RPM disk drives.  These drives are roughly an order of magnitude lower in $/TB than inexpensive SSDs.

HCI vendors worry about low latency, and system costs keep rising. Because heritage HCI can have bad surprises in latency and data migration/rebuilds, these systems are increasingly turning to small all-flash clusters.  E.g. HPE only promotes Simplivity as all-flash; some vendors only offer decent data reduction in all-flash configurations.  Once all-flash, HCI’s homogeneous nodes are uncompetitive in $/TB for backup.

Some HCI vendors are even spending a lot of time building an alternative hypervisor.  Is that really the thing IT people hate in their current environment?  We haven’t found that it is.  But they do hate backup & DR.

Why this is in Datrium’s mission: the primary/backup split keeps infrastructure too complicated.  Backup has always sucked. IT people have better things to worry about.  Well known separated-backup problems include these, all solved by Datrium today:

  • Administrators have yet another system to buy, learn and operate across two different management planes. The VM data lifecycle ends up being a kit to assemble.  Only a few combinations are tested together by vendors, and it’s fragile to maintain.
  • Slow – higher RTOs than snap restart. Data has to be copied back to primary before a VM can be restarted, adding time. Copies going back to primary need to be rehydrated, so they are also at maximum size, slowing completion.
  • Less-frequent RPOs than snaps: backups are a pain, and it takes too long vs. snaps.
  • Encryption is not end to end. If backup systems are to have a prayer of data reduction for lower $/TB, they have to receive clear data.
  • Data reduction is not always-on or end-to-end. In hybrid and disk arrays, it’s almost never a default. Most HCI does not dedupe its cache, dedupe is node-local, and dedupe (like erasure coding) hurts speed.
  • Data verification is weak and inconsistent across the data lifecycle.  Most IT doesn’t test recovery on much of all their data very often.  If there is corruption in primary from, for example, RF=2 levels of data loss, there is generally no automatic way to test data copies locally for data correctness, let alone across two separated storage systems.
  • Separate backup catalogs need to be added to SANs. SANs can’t see individual VMs for recovery; they’re hidden in LUNs. Restoring a volume snap doesn’t restore a VM.

If you want to simplify to be more cloud-like on-prem and use a simpler and more cost-effective Cloud DR service, separated backup has to go.