Cloud Backup != Cloud DR – Find Out Why

According to The State of Enterprise Data Resiliency and Disaster Recovery 2019, disasters ranging from natural events to power outages to ransomware affected more than 50% of enterprises in the last 24 months. Recovering from disasters quickly, simply, and economically is more important than ever. A key part of disaster recovery is getting the data to a second site that’s unaffected by the disaster and has compute resources available for post-recovery operation.

But first, a little analyst terminology. Everyone uses backup and knows what that is. What is DR? We use what Gartner says it means to an enterprise: “Disaster Recovery means the methods and procedures for returning a data center to its fully operational state after a catastrophic interruption.”

 

  • Step 1: Access to the right data in a different infrastructure. If recovering from ransomware, the right data might be from months ago.
  • Step 2: Bring up the workloads, in the right order, on the right systems, dealing with differences in networking, etc. That is vastly more practical and automatable for virtualized workloads than for physical. 
  • Step 3: Fail everything back to the originating site, with the same concerns for workload sequencing, mapping, etc. These last two steps require runbook orchestration.

Backup and storage vendors sometimes forget that it’s not just about Step 1.

In theory, offsite backup to low-cost media (e.g. AWS S3) in any region in the world and on-demand compute economics could enable the public cloud to be a perfect fit for disaster recovery. Get your data to a geographical region of choice on low-cost media and spin up compute when disaster strikes, so you can work with that data. It should’ve been a perfect use case for the cloud. But there are some clear reasons why Cloud DR is still rare in the wild.

 

Only Some Clouds are Perfect for DR

vSphere is far and away the most common on-premises hypervisor, with a 90% attach rate. Methods for disaster recovery based on VM conversion from an on-premises vSphere format to a native public cloud format have been around for 5-10 years now. These methods are very problematic for three reasons.

  • A cloud-native management stack is drastically different from a vSphere management stack – not the kind of situation to deal with in the middle of a
    disaster. Would you like to read the manual when the plane’s engine is on fire?
  • VM conversion is fragile and buggy. It never works for all the VMs you need.
  • VM conversion is extremely slow. A reasonable range is 250-500MBps to convert a VM from a snapshot on S3 into cloud-native format. For a 100TB dataset, you’re looking at 2-3 days to convert your workloads to cloud-native format. The recovery actually begins after that long process! That’s impractical in the extreme.

 

Enter VMware Cloud on AWS! But…

VMware Cloud on AWS solves #1 and #2 above: the exact same on-premises vSphere stack is now available in the public cloud, offered as a service, on demand. No need to learn a new management stack, or rely on fragile VM conversion processes. But what about the data on premises?

The standard answer is to use backup technology – there’s a whole slew of vendors who can ingest data from on-premises data centers and land it on S3. In the event of a disaster, some of these vendors can provision an on-demand SDDC on VMware Cloud on AWS and copy data from AWS S3 (without any conversion) into that SDDC. Then you can execute the DR runbook.

That avoids the conversion issue, but it’s still as impractical as conversion-based methods because of the copying step. S3 read throughput and latency can vary wildly, and there’s a single 25Gbit xENI link between AWS and VMware Cloud on AWS. Long story short, 500MBps to 1GBps is a reasonable rate of copying data over the xENI link between AWS and VMware Cloud on AWS. For a 100TB dataset, it still takes approximately 1.5-2 days before the data is all there on the SDDC. After that, the actual recovery can start. Again, this is impractical in the extreme.

 

Converging Cloud Backup and DR

Backup solutions, especially in the cloud, can’t execute workloads on VMware Cloud from their data on S3 – even backup vendors who are known for “live mount” recoveries on premises. The random I/O required to execute a workload is simply not supported by existing backup solutions that are built on object stores, especially when bringing up a whole data center with 100s of TBs of data.

The ideal solution should have the following properties:

  • Be able to ingest data from vSphere environments on premises
  • Store the data with backup-style data reduction and retention on low-cost media like AWS S3
  • Avoid VM conversion and management stack differences by leveraging SDDCs on VMware Cloud on AWS
  • Be able to bring up workloads from the S3 backups onto the SDDC as live datastores: boot directly.

 

That’s precisely Datrium’s approach to Cloud DR. Datrium DRaaS enables the customer to ingest data (in VMDKs) from on-premises data centers into S3. This data is stored in a compressed, encrypted, deduplicated format on AWS S3 as native vSphere VMDKs. (Find out more about this topic in our blog post, Global Dedupe Meets Multi-Cloud: Building blocks for Tier-1 Cloud Backup.) When disaster strikes, the customer can provision an on-demand SDDC on VMware Cloud on AWS using Datrium DR orchestration software. 

The key technology is the next piece: Datrium built a cloud-native NFS datastore using EC2 (storage I/O processing), EC2 Instance Flash (caching on NVMe flash), and backups on low-cost media (AWS S3). Caching on NVMe flash solves for any random I/O, and EC2 supplies scalable storage processing. That is, the backups in S3 are now instantly executable from a live NFS datastore with support for random I/O. ESX hosts in the cloud mount this live NFS datastore. The DR runbooks then execute immediately, and 1000s of VMs can start running with Instant RTO: there’s no copying or VM conversions. 

What makes this approach possible is the convergence of backup (low-cost media, granular recovery) and DR (orchestration software, random I/O performance). Public cloud has long been the holy grail for DR. At long last, it’s a reality with Datrium DRaaS and VMware Cloud on AWS.