Datrium provides convergence for the entire lifecycle of data in a datacenter, with a high performance primary storage system integrated with a built-in efficient backup system. Datrium is now extending our convergence model beyond the datacenter. Our first step in that direction is the Cloud DVX offering, with a specific focus on doing backups/restores to/from the public cloud.
Once backups are in the public cloud, Datrium will natively provide powerful built-in DR orchestration. It will be super simple to use without having to employ multiple solutions to protect your VMs. There will also be a global catalog view of all your data to make it easy to operate on.
That’s all cool, but how does one go about actually building this? While retaining the simplicity of the platform? We first started with a wish list:
- Must be super easy to operate.
- Must be very budget friendly
- Must have awesome SLAs.
- Must provide Tier-1 reliability.
Solving for two variables solves everything else
It turns out that solving the below two important problems solves all others.
Problem #1. – AWS is hard to use
AWS has a lot of native services and resources, but they are like Lego building blocks that you will need to put together. One option is to have software that the customer downloads and runs on some compute in AWS – but that’s just too hard. Additionally, who will monitor this software? And who will upgrade it? Expecting the customer to do all this is too painful. Yes, you will have a product, but who really wants to deal with this complexity? So, what does it take to provide an experience that is actually painless, and possibly joyful?
Problem #2. – AWS is expensive
The AWS pricing guide book is 80 pages long—everything can get expensive, fast. Backup is the simplest of use-cases for public cloud, but it still gets very expensive if the design is not thoughtful. For example, if you do weekly fulls and daily incrementals to AWS for a 60-day retention or longer, your bills will explode. Not to mention the load on your WAN. Plus, if you are going to backup legacy LUNs, that’s just rubbing salt into the wounds. So, what does it truly take to provide cost efficiency for an enterprise scale cloud backup?
Solving for Cost Efficiency
Cost Efficiency #1. – LFS meets S3
It is useful to make data as cost efficient as possible, even more so in a public cloud. A low RTO is desired for a backup use case, and that makes the case for using either EBS or S3. EBS is easier to use, but is very expensive and can lose data. S3 is much cheaper and highly scalable, but it imposes certain consistency restrictions: the data needs to be packaged as S3 objects while avoiding partial overwrites.
In a previous article, we explained how a Log-Structured Filesystem (LFS) is the superpower behind the full stack of data services offered in the Datrium’s DVX product. We said “the sky’s the limit”. Well, is that really true as we extend beyond the datacenter?
As weird as the S3 restrictions are, it turns out they are a perfect match made in heaven cloud for LFS. The LFS design produces large sequential blobs that map directly to S3 objects, and never are partially overwritten. Datrium’s software uses AWS S3 without needing any changes to the filesystem. Even we are a bit dazzled by how simple it was to implement our new Cloud DVX offering.
It’s amazing how many superpowers LFS has! The result of using S3 is that Datrium’s cloud DVX offering has high performance, scales well, and is cost efficient.
Cost Efficiency #2 – Global Dedupe makes it all faster
How long does it take to move data around in a multi-cloud environment? And, do you know that there is an egress $/GB cost? What if you could cut the time and cost by 10x or, even 100x? That’s only possible if there is global dedupe across multiple sites. And WAN optimization devices will not help.
Datrium provides end-to-end global dedupe, at rest and across clouds, no matter where the data is and how it got there. Not only does global dedupe save cost on the cloud storage, but also on WAN traffic because only new data is ever sent over the wire.
Cost Efficiency #3. – Incrementals Forever
Cloud maybe the new tape replacement. However, that does not mean we need to follow tape mechanics, like sending weekly fulls. Imagine sending 30TBs every week over the WAN. This seems wrong. Is this even reasonable to ask our customers?
Datrium software has been designed to do incrementals forever. The only reasons to send fulls is during initial seeding or when new VMs have been created. Even in this case, global dedupe is employed to send just the missing data. The same logic applies when trying to recover data. Only the missing data is sent back, making the experience faster and cheaper.
Data correctness is verifiable
It is clear that many customers will choose a multi-cloud strategy to avoid vendor lock-in, and that implies that data needs to move efficiently across clouds. There is also a need to be assured that the data did get moved “correctly”. Datrium’s global dedupe comes with a content addressing scheme that reliably verifies correctness—kind of like blockchain. This is a lesser known, but a very powerful attribute of global dedupe.
Do you really want to find out data is missing when it is time for recovery? No! It is better to find out any issues sooner than later, and so the data is periodically checked for integrity against software bugs, h/w bugs, etc.
Solving for Ease of Use
Ease of Use #1. – Please, no knobs in the cloud!
Some HCI vendors claim “1-click”. Dedupe is a click, Compression is a click, and so on. And by the way, you need to read the manual before you perform the clicks. Who has the time to learn about all these knobs in today’s busy world? And as workloads transition back and forth in a multi-cloud environment, it is going to be a nightmare to sit and tweak these mundane settings. 1-click sounds good but it turns out to be actual work for the end user. 0-click is way better. But, that requires Design-Thinking when building the system foundations. We’ve eschewed knobs.
Ease of Use #2. – SaaS experience
We wanted to provide a cloud SaaS like experience. Customers don’t need to install software or manage the software or resources. They can just consume it, and we will take care of the rest. And that’s how Cloud DVX is offered, as-a-service. Datrium will be responsible for upgrading and monitoring the service. The service also sends home periodic deep telemetry heartbeats which is then mined for proactive anomaly detection.
Ease of Use #3. – Lambda to the rescue
AWS services can sometimes go down (this is one of the things that makes the cloud hard to use). So, we use AWS Lambda (i.e., server-less event driven services) to monitor the Cloud DVX service. The goal is to continually look for anomalies and issues, and then, self-heal. Datrium’s Lambda monitoring software detects and rectifies the issues by restarting our services using different resources. This is part of why we call Datrium’s offering “Enterprise Tier-1” ready, by masking all AWS issues.
Ease of Use #4. – Fine Grained Recovery
Backup is all about recovery, and fine grained recovery is even more important when doing it over the WAN with $/GB transfer cost. Datrium’s offering enables recovery of individual VMs or virtual disks. In the near future, there will be logic to recover even more fine-grained objects within the guest VMs. And, all from a single pane of glass.
The logic behind Datrium’s Cloud DVX offering is to provide an auto-managed service for backup and recovery. For the user, it’s super simple to operate, and it carefully blurs the line between the local and public clouds. Global dedupe as a foundation makes this offering budget friendly, fast, and reliable.