Disaster Recovery Use Case #2 – Human Error

Datrium’s Automatrix platform offers a comprehensive set of features that enables fast and cost-efficient disaster recovery (DR) mitigation for enterprises. In this blog post series, we’re looking at the four most recognizable use cases that cause organizations to trigger a DR plan:

 

  • Human error
  • Power failure
  • Natural disasters
  • Ransomware (link)

 

You’ll learn how Datrium helps organizations in each of these cases. With every new blog post in this series, we’ll introduce more details about the platform’s features and capabilities.

In the previous post, we provided an introduction to the Datrium solution and architecture. Automatrix is a single powerful platform that includes DVX with all its capabilities (primary storage, deduplication, compression, encryption, backup, and replication), Cloud DVX for long-term retention in the cloud, and ControlShift for orchestrating disaster recovery. We also introduced Datrium Disaster Recovery with VMware Cloud on AWS which offers 10x more cost effective DR. With its on-demand model, you only pay for DR resources in the event of a disaster or when testing.  

In this article, we’ll cover how a DR plan can help with inevitable human errors.

 

Human Error

A 2016 report by the Ponemon Institute found human error to be the second most common reason for datacenter downtime, accounting for 22% of all incidents.

Another study by the Uptime Institute suggests the problem may be even worse. Uptime’s analysis of datacenter outages found that more than 70% were directly attributable to human error, and staff training was one of the biggest datacenter oversights.

 

2016 report by the Ponemon Institute

 

How can this problem be mitigated? The most common procedures may include shielding emergency buttons, documented methods and procedures, consistent operating practices, ongoing personnel training, secure access policies etc.

Unfortunately, even if you take every precaution, human error can’t be fully eradicated. We’re all human beings, and humans make mistakes.

The single best way to safeguard against human errors and lessen the potential impact on the business is to maintain a regular, secure backup system alongside a clear recovery plan that allows you to restore operations immediately if needed.

There are two common scenarios when it comes to recovering from human error; the first one is recovering applications on the same site if you have a partial failure, and the second is a full datacenter recovery on a DR site.

 

1. Partial or Application Recovery

When you discover a partial failure, your goal is to restore to a safe state as quickly as possible. I’ll cover the most common cases within the context of Automatrix and its capabilities.

Due to its disaggregated architecture, Automatrix can execute non-disruptive snapshots and backups of VMs and applications with an RPO as low a one minute. Then restarting a snapshot into the live production environment is simple, and restores have zero RTO. However, because there’s no need to copy data from a backup silo or from a different toolset, restores are virtually instantaneous.

The datastore contains the current, live version of all VMs and files, and that’s what the hypervisor sees. The hypervisor management tool allows you to browse the contents of the live datastore at any time. These files always contain the most recently written data.

The “snapstore” on the other hand, contains previous point-in-time snapshots of the live data store as it existed previously. Every time a protection group causes a snapshot to be taken, entries are made in the snapstore with the contents of every file in the live data store at that instant in time. All snapshots are pointer-based, instantaneous and do not require copying of data.

Automatrix uses a “redirect on write” (ROW) technique to store incoming data. New data is always written to new locations; versus by contrast, copy-on-write techniques that can introduce delays as changes are copied. Because only changes are stored in a snapshot and DVX only stores compressed and deduplicated data, snapshots consume relatively little capacity.

Automatrix also supports retrieval of a single file from the context of a virtual machine operating system environment on the local site or on a remote site, either on premises or from Cloud DVX. Automatrix Guest File Restore (GFR) works in tandem with Cloud DVX, enabling guest objects, such as Microsoft Word documents, to be seamlessly retrieved from any Automatrix storage tier.

Automatrix helps organizations to instantly recover from application misconfiguration, virtual machine deletion, data corruption, file deletion, and more.

When it comes to power failures, the Datrium disaggregated architecture provides high-data resiliency by using dual-powered enterprise-grade data appliances, which combines the durability of a SAN with the simplicity of HCI. On the compute side, each server is stateless, and as long a single server is still powered and connected to the data nodes, applications still run, albeit with limited CPU power.

 

2. Datacenter Recovery

When the entire datacenter is experiencing downtime, applications must failover to a DR target. This type of datacenter downtime can be triggered by power failure caused by human error, but it also can happen due to human actions on applications or virtualization clusters.

ControlShift is Datrium’s DR orchestration delivered as a SaaS application, which is driven by the same policy and snapshot system that enables backup in Automatrix. Using ControlShift, it’s simple to failover applications running on a Datrium production site to a DR site, but it’s also simple to failback. Here’s an overview of ControlShift features:

 

  • Runbook orchestration for VMs to restart correctly in a different datacenter.
  • Restart from current data or older backups. Unlike many DR systems, Automatrix is built to incorporate both current and old VM snapshots, so it’s ideal for ransomware or point-in-time recoveries. Ransomware can take weeks to detect which means you may not be able to recover from the latest point. You may need to recover from a much older snapshot or backup depending upon when your systems were impacted
  • RCO (Recovery Compliance Objective) of 30 minutes. Because Automatrix is a consolidated data plane with a focus on VMware and Kubernetes, it’s built to perform compliance tests of all required failover/failback resources every 30 minutes. It also offers a full test bubble system.
  • DRaaS (DR-as-a-Service) provides a subscription model and is fully integrated with VMware Cloud on AWS for on-demand disaster recovery. Datrium provides fully integrated purchasing, support, and billing for all components and services, including VMware Cloud on AWS and AWS itself. It’s delivered as a SaaS solution that eliminates all the complexity of packaged software.

 

Disaster Recovery with VMware Cloud on AWS

At a high-level DRaaS is a complete, fully orchestrated DR solution for the VMware ecosystem that is offered as subscription and leverages VMware Cloud and Datrium cloud-based backups. 

All Datrium services are deployed as Amazon Machine Images (AMIs) into a Datrium-created Virtual Private Cloud (VPC) and Subnet. VPC endpoints are used to access all external services required by ControlShift and Cloud DVX, and they are created automatically. All components are monitored and restarted for high availability and resilience.

VMware Cloud (VMC) provides a vSphere-based execution environment as a DR target. A VMC SDDC can be provisioned on demand via ControlShift, and a provisioned SDDC incurs hourly charges. Upon DR test completion, the SDDC is decommissioned via ControlShift. 

ControlShift performs automated network configurations for both AWS and VMware Cloud to make S3 backups from Cloud DVX available for spin-up in SDDC. The SDDC is managed using the familiar vCenter interface.

When it’s time to restore operations on your primary datacenter, ControlShift efficiently fails back applications with minimal AWS egress charges by transferring only changed and globally deduplicated data; similar to failover, failback is fully automated. Data changes that occur while executing in the VMware Cloud are captured and stored as a Cloud DVX snapshot in S3.

When using DRaaS, there are two primary modes to choose from:

 

Just-in-Time – This mode eliminates any infrastructure upfront CAPEX costs and drastically cuts OPEX costs. You only pay for VMware Cloud when a disaster occurs. However, when your DR plan is triggered, you may need to wait for an SDDC creation, and that may take approximately 90 minutes. After the DR event is over, the changes are synchronized back, and the SDDC is torn down.

 

 

 

Pilot-Light with Cloud Burst – This mode is a compromise between the two options above. Controlshift creates an SDDC on VMware Cloud with a minimal number of hosts to failover the most critical VMs with very low RTO. Then, on demand, new hosts are added to the SDDC to complete the failover of less essential VMs. In this mode, you pay for just a minimal number of hosts until the DR plan is triggered, and then full capacity only when DR is in full effect.

In this post, we covered human error scenarios and DRaaS modes. In our next post, we’ll cover power failure scenarios. We’ll go a little deeper into unique Datrium technology that enables data and business recovery.