Disaster Recovery Use Case #3 – Power Failure

The Datrium Automatrix platform offers a comprehensive set of features that enable fast and cost-efficient disaster recovery (DR) mitigation for enterprises. In this blog post series, we’re looking at the four most recognizable use cases that cause organizations to trigger a DR plan:

 

Power failure

Natural disasters

Human error

Ransomware

 

You’ll learn how Datrium helps organizations in each of these cases. With every new blog post in this series, we’ll introduce more details about the platform’s features and capabilities.

Previously we discussed how Automatrix provides resilience for partial or entire application recovery; plus, we covered full data center recovery, including using on-demand VMware Cloud on AWS as the DR target, which offers 10x more cost-effective DR.

In this article, we’ll cover how a DR plan can help companies recover from a power failure.

 

Power Failure

In the previous post, we talked about the 2016 report by the Ponemon Institute that found human error to be the second most common reason for data center downtime. The combination of UPS System and Generator failures account for 31%, and they’re the most prominent reason for data center failures. Some of the different variants for power outages include:

  • Generator failure to start
  • Generator fails after X number of hours running
  • Street power fails partially (usually one of three phases)
  • UPS fails to switch to battery
  • UPS fails to turn from battery to input power
  • UPS continues to run due to street and generator failure, and a total data center outage is imminent.

Some precautions to avoid power failures include N+1 power redundancy for maximum system availability, better circuit load management, and dual power protection.

 

PG&E Phenomenon

Even if the data center is adequately protected against power failures, some events will entirely fall outside the norm. If a UPS runs out, and generators cannot withstand multi-day outages, you could be faced with a total data center outage. This situation is precisely what’s happening now in California.

In an attempt to avoid sparking a wildfire, California’s largest utility intentionally cut power to hundreds of thousands of customers, and energy isn’t likely to be restored for days. Some authorities are asking the population to prepare to be without power for as long as seven days.

While data centers are likely protected with 24-hour generators, branch offices with server rooms and edge locations, like cellular towers, are not protected, and in some regions, mobile phones aren’t working.

In this scenario, organizations can preemptively move DR applications to different regions, but in many cases, specifically with natural disasters, the migration occurs post-event.

How can this problem be mitigated?

 

Data Center Recovery

When the entire data center is experiencing a power failure, applications must fail over to a DR target outside of the affected zone.

ControlShift is Datrium’s DR orchestration solution, which is delivered as a SaaS application. It’s driven by the same policy and snapshot system that enables backup in Automatrix. Using ControlShift, it’s simple to fail over applications running on a Datrium production site to a DR site, but it’s also simple to fail back. Read the previous post to understand ControlShift features.

 

Disaster Recovery with VMware Cloud on AWS

In the previous article, we explained how all Datrium services are deployed as Amazon Machine Images (AMIs) into a Datrium-created Virtual Private Cloud (VPC) and Subnet. VPC endpoints are used to access all external services required by ControlShift and Cloud DVX, and they’re created automatically. All components are monitored and restarted for high availability and resilience.

VMware Cloud provides a vSphere-based execution environment as a DR target. A VMware Cloud SDDC can be provisioned on demand via ControlShift, and a provisioned SDDC incurs hourly charges. Upon DR test completion, the SDDC is decommissioned via ControlShift. 

ControlShift performs automated network configurations for both AWS and VMware Cloud to make S3 backups from Cloud DVX available for spin-up in SDDC. The SDDC is managed using the familiar vCenter interface.

 

 

When it’s time to restore operations on your primary data center, ControlShift efficiently fails back applications with minimal AWS egress charges by transferring only changed and globally deduplicated data; similar to failover, failback is fully automated. Data changes that occur while executing in the VMware Cloud are captured and stored as a Cloud DVX snapshot in S3.

Our on-demand model for DRaaS radically changes the economics for DR. Customers have reported saving almost 90% over traditional DR approaches, such as having a secondary physical site or an “always-on” DR environment in the cloud.

DRaaS enables you to provision an on-demand SDDC in VMware Cloud on AWS and pay as you go – for testing or in the event of a disaster. The only steady-state cost is storing data-reduced backups on S3. You get protection from power failures, ransomware, and natural disasters in a single solution. Unlike other DR solutions, we keep virtual machines in their native vSphere format, which eliminates brittle, time-consuming VM disk format conversions.

While your production site is up, Cloud DVX is constantly backing up your data to AWS S3 with low RPO and global dedupe to minimize costs. When disaster strikes, ControlShift executes your fully compliant DR plan to fail over your workloads to an on-demand SDDC created in VMC on AWS immediately after the disaster strikes. You get a consistent operational experience with Vsphere both on premises and in the cloud, so you and your team don’t have to learn a new set of tools. 

Datrium provides fully integrated purchasing, support, and billing for all components and services, including both VMware Cloud on AWS and AWS. It’s delivered as a SaaS solution that eliminates all the complexity of packaged software.

 

 

When using DRaaS, there are two primary modes to choose from:

Just-in-Time – This mode eliminates any infrastructure upfront CAPEX costs and drastically cuts OPEX costs. You only pay for VMware Cloud when a disaster occurs. However, when your DR plan is triggered, you may need to wait for an SDDC creation. After the DR event is over, the changes are synchronized back, and the SDDC is torn down. 

Ahead-of-Time Deployment – In cases where a DR site has a secondary function of executing non-DR workloads during regular operation, an SDDC can be provisioned before failover. If the sole purpose of the Cloud DR site is to take over workload execution in the event of a disaster and it remains otherwise unutilized, further significant cost savings are possible with just-in-time deployment.

Pilot-Light with Cloud Burst – This mode is a compromise between the two options above. ControlShift creates an SDDC on VMware Cloud with a minimal number of hosts to fail over the most critical VMs with very low RTO. Then, on demand, new hosts are added to the SDDC to complete the failover of less essential VMs. In this mode, you pay for just a minimal number of hosts until the DR plan is triggered, and then full capacity only when the DR plan is in full effect.

In this post, we covered human error scenarios and DRaaS modes. In our next post, we’ll cover power failure scenarios. We’ll go a little deeper into unique Datrium technology that enables data and business recovery.

On Premises to Cloud Connectivity

There are few options when it comes to connecting on premises data centers to VMware Cloud on AWS. Below are the four most common options, but please note that the options below are addressing user connectivity to their applications, rather than Datrium replication between on premises and Cloud DVX.

  • AWS Direct Connect (DX) is a service aimed at allowing enterprise customers easy access to their AWS environment. Enterprises can leverage DX to establish secure, private connectivity to the AWS global network from their data centers, office locations, or co-location environments.
  • A Layer 2 Virtual Private Network (L2VPN) can be used to extend an on-premises network which provides a secure communications tunnel between an on-premises network and a network segment in VMware Cloud on AWS SDDC.
  • IPsec VPN is a feature of VMware Cloud on AWS, which provides secure access to on-premises management and workload connectivity via a secure IPsec VPN tunnel.  
  • A third-party Virtual Private Network (VPN) solution can be used to extend an on-premises network to a public cloud SDDC. Many VPN providers offer a virtual appliance deployment option. These vSphere compatible appliances can be deployed in VMware Cloud to offer another method of extending an on-premises network or enabling individual users to access workloads running within VMware Cloud.

In this post, we covered power failure scenarios and connectivity for applications running on VMware Cloud on AWS in DR mode. In our next post, we’ll cover Natural Disaster scenarios, and we’ll go a little deeper into unique Datrium technology that enables data and business recovery.