CloudShift: Failproof Hybrid Cloud Disaster Recovery

The Race to Simplicity

There is continuous innovation in the IT world.  Yet, somehow, Disaster Recovery (DR) planning and execution continues to be complex and error prone.  As shown in Figure 1, there are multiple products that are needed to create a DR plan.  Each product is from a different vendor, and they are all loosely stitched together.  All of these vendors don’t get together to test DR to give guarantees.  That leaves the customer the “joy” of stitching them all together.  Despite the best efforts, these different products follow Murphy’s law when working in tandem during a disaster.

Figure 1: Legacy DR Setup – Complex and Scary

We decided to take a different approach to solving DR.  We wanted to give our customers a totally different kind of experience.  Here is what we wrote down as requirements for our disaster recovery offering:

  1. Must be flawless
  2. Make it bulletproof
  3. Provide transformational economics
  4. Must be simple to use, especially at scale
End-to-End Data Lifecycle Management

The only way to provide a bulletproof DR solution is if all the pieces work together flawlessly.  The only “practical” way to do this is if the end-to-end data lifecycle is managed from one platform, which is well built and well tested for all DR scenarios.

Datrium was born in a multi-cloud world.  We created a software platform for managing the entire data lifecycle, from primary to backup to cloud.  It is one unified platform, with one console to manage the entire lifecycle of the data.  The Datrium on-prem solution is a software defined platform that runs on commodity hardware.  It can run applications at very high performance, with built-in backup.

Our cloud services follow the SaaS model for simplicity.  Our first cloud service was for keeping deep copies for offsite backups/archiving.  We are now introducing CloudShift, our second cloud service.  The general perception is that cloud is expensive.  It doesn’t have to be that way.  CloudShift demonstrates that massive cost reduction is possible.  Datrium’s cloud services simplify and reduce the TCO of the whole IT infra spectrum, regardless of where the workloads run.  This is the future of IT, and Datrium wants to enable that future.


Figure 2. Datrium Hybrid Cloud: End-to-end data lifecycle management

CloudShift – DR Orchestration Service

The CloudShift DR orchestrator helps with both possible DR failover scenarios: prem-to-prem and prem-to-cloud.  The orchestrator helps the admin create DR plans, test the DR plans, and execute them with low RPO/RTO.  We extended the culture of paranoia around data integrity to the DR plans as well.  As an example, we have built-in continual compliance checks (an industry first) of the DR plans.  Automatic checking is the only way to ensure that DR plans are ready to execute anytime.  Having a single end-to-end stack makes everything easy and assured.  Here is a sample list of attributes that make CloudShift be simple and bulletproof.


Figure 3. Failover VMs to another Site or to VMware Cloud on AWS

One Console – Primary + Backup + DR Plans

DVX integrates primary and secondary storage making it possible to use a single management console to establish backup and replication policies and to configure, test, and execute DR plans.  All abstractions are at the VM or groups of VMs.

No VM conversions

To use cloud as a successful DR target requires that it maintain operational consistency with the on-prem tools and apps.  Otherwise, cloud DR is not a practical option.  Datrium works with VMware Cloud on AWS when failing over to the cloud.  This results in consistent operational experience between on-prem and the cloud.  There is no conversion of VMs, or change in consoles.

Continual Compliance Checks

CloudShift automatically performs continual DR plan compliance checks to assure that the changes in the execution environment do not invalidate DR plans.  This is part of what makes DR plans be bulletproof when you actually need to use them.  The system’s built-in compliance checks can pinpoint problems anywhere in the backup and DR stack. For example, replication failures due to network connectivity losses will automatically flag all affected DR plans.

End-to-End Data Integrity Checks

A platform that manages end-to-end data lifecycle also has the opportunity to verify everything with end-to-end integrity checks as data is orchestrated through the different phases locally and when it moves to a different system.  Datrium employs an efficient blockchain-like scheme to calculate cryptographic hashes of backups and primary storage to continuously validate data integrity across the entire distributed environment, both on-premises and in the cloud.

Automated DR Testing

CloudShift orchestrator allows for an automated DR testing on a schedule.  For example, it can create a “test bubble” at the DR site and execute the plan every weekend, and followed by a detailed report on the execution.

DR Orchestration As-a-Service

Datrium’s CloudShift is delivered as-a-service: there is nothing to install and nothing to manage. The CloudShift orchestration engine runs as an AWS-based service and leverages the public cloud infrastructure to achieve high availability for its internal operation.  Monitoring and upgrades are automated and performed by Datrium as a part of the service offering.

Cloud as DR site

DR to the cloud is an attractive option and gives customers an opportunity to eliminate an on-prem site dedicated to DR.  However, a few things are needed to make this option viable.  It needs a well integrated cloud offering that orchestrates data in and out of cloud, continuous checks to make it dependable for business critical applications, and one platform that takes you all the way without hiccups.  And, of course, it needs to be economical.

Upon failover to the cloud, Datrium works with VMware Cloud on AWS to provide a smooth consistent operational experience (described above).  Datrium’s DR to cloud solution offers some significant advantages.

Only Pay On Disaster

The VMware Cloud servers are only deployed when they are needed for the failover during a disaster.  So, there is a charge for the servers only when the disaster strikes (let us say for a week).  This can literally translate into a 100x savings because the servers are provisioned just-in-time.  This avoids having to keep a set of expensive servers always provisioned.  The just-in-time provisioning in the cloud provides transformational economics.

Global Dedupe

Datrium’s filesystem has global dedupe (i.e., locally and over WAN).  Imagine a 100TB system failover to the cloud for a week.  In the one week, there is a 2TB change of data.  The question is how much data is transferred back to the primary site upon a failback?  Without global dedupe, the entire 100TB will be transferred back, causing an expensive egress charge from the cloud, and taking a lot of time to transfer all the data back.  Datrium’s filesystem will only transfer back the changed 2TB, reducing cost and time to failback.


Datrium is taking an end-to-end approach of solving data lifecycle problems.  Otherwise, you end up with many disparate products that don’t work well together.  The future of IT wants to be simple, and we want to provide all the tools necessary to help with that.  Our cloud services are built as SaaS models to be simple, and provide a smooth transition to leverage your existing on-prem IT investment and transition to the cloud.  We are introducing our new CloudShift service to provide an easy and bulletproof path into that journey.