What Separates the Storage Industry’s Men from Boys? Random Writes!

Every storage vendor cites read IOPS numbers, but the hard part is writes.

 

Anyone in the storage industry knows that there are many ways to game benchmarks. But ask any engineer or SE about what’s a truly challenging workload, and they will tell you “random writes over a large address space”, especially on disk.

If you have, let’s say, 120 HDDs, each with 100 IOPS, for a total of 12K IOPS. How much random write throughput would you be able to sustain with 32KB block size? Normal math will tell you that 12K * 32K = 384 MBps, if you have no data protection. If you want data protection, the system consumes more IOs for writing parity stripes (or for replication, as with HCI) and the available throughput goes down.

There’s a good reason I cannot find pure 100% random write numbers published by any vendor. It’s a hard problem, even for AFAs. In fact, if you know of some vendor publishing 100% random write numbers, please send them to me, I’d love to see them.

By the numbers

So, how does Datrium DVX do with a 100% random write workload? In our performance labs, we decided to push an all-disk 10 data node DVX system as hard as we could. Here’s the setup:

  • 480 VMs, each with 1TB logical span
  • 100% 32KB random writes
  • Undedupable data
  • 30 servers
  • 10 data nodes

We populated the system overnight with data of varying compressibility. At the time of measurement, all 480 VMs were doing 100% random writes, 32KB block size, 2X compressible, undedupable data (no games with dedupe), over a 70 TB logical address space.

You can see the DVX has 8.2 GB/s throughput, 263K IOPS, with 1.5ms latency. With 100% 32KB random writes.

Things to note:

  1. All the data is being compressed and erasure coded inline.No Knobs and checkboxes.
  2. Deduplication is also turned on (no way to turn it off in a DVX) but the system won’t find any dedupe because of the load pattern.
  3. The effective queue depth is around 500 when the screenshot is taken, and the 1.5 ms latency is as seen by the VM.All arrays report only latency at the array; the network can easily add 1ms of latency.
  4. Some vendors publish write benchmarks by overwriting the same tiny file that fits in NVRAM. We are not playing that game, we are writing randomly to 70TB of logical address space in a sustained write benchmark.

The reason the DVX can achieve such write performance is because the foundation of the DVX is a distributed log-structured-filesystem. The roots of the DVX go all the way back to this seminal paper by Mendel Rosenblum (incidentally, a Datrium investor) and John Ousterhout, which won the SIGOPS hall of fame award.

I’d love to have a comparison of our write numbers vs the competition. In general, vendors do not seem to publish large span random write numbers. They play it safe by citing Read IOPS numbers, coming out of a buffer cache (VMAX has 2TB of that). We can play that game too, scoring 18M IOPS on random 4K reads actually coming from SSD devices – but this is about writes, read on for an explanation why.

So what, who cares about disks?

Some of you may be wondering, “So what? Why should we care about writes to spinning disks? AFAs must have better performance, right?”

Well, first off AFAs are notably worse than a DVX, thanks to SSDs on the hosts and data locality (read on for a comparison of performance with AFAs).

The main point I want to make is that disks will remain important in your infrastructure even if primary workloads run entirely out of flash, as they should. 

The reason is backups. SSDs are nowhere near HDD economics for backups and long-term archiving. Even if you deploy AFAs to satisfy your random-write performance needs, you will find yourself acquiring a complete disk-based backup infrastructure, often from several different vendors: backup software, data movers, a dedicated disk-based backup appliance (a custom backup array).

If there is a requirement for a separate DR site, multiply this backup infrastructure by two and throw in WAN acceleration hardware. Then you will have to manage and administer interactions between primary and backup arrays on both sites, with many moving parts and things inevitably going wrong.

Often, the aggregate costs of secondary storage hardware and backup/recovery software, combined with the ongoing management of these disparate infrastructure silos, exceeds the cost of the primary storage AFA.

This is what the DVX is all about: a single system with high-performance, low-latency primary storage better than AFAs because of flash on hosts, combined with built-in backup & replication software and a disk-based, cost-optimized, scalable, highly durable data repository.

The DVX is all you need because you get primary storage, backup & replication software, backup media, offsite archiving, WAN acceleration and unified management of all your infrastructure needs.

For any system to achieve this, you need to work great with disk writes because you are going to be running your big random-IO hungry Oracle database on the same system that is storing 15-min backups!

So how about random reads & mixed read/write workloads?

I’m not writing about random-writes because we’re poor at read performance. The DVX has insane read performance on a single host, and performance scales linearly with the number of hosts because unlike HCI, there is no cross-server chatter and unlike arrays there is no controller bottleneckHere’s a table summarizing DVX performance:

The DVX can do 18 million read IOPS (4KB random reads) and 200 GBps bandwidth (32KB random reads). This is 3X more IOPS than the biggest all-flash VMAX 950f system (spec here) and 17 times more IOPS & bandwidth than the largest Pure Storage FlashArray system (spec here).

Note that the quoted number for the DVX is coming from actual SSDs, because we made sure that the random-read span is so large that it will not hit in the in-memory RAM cache. Many array vendors game this by reading a tiny file that fits in expensive RAM caches, we did not.

How about a 70% Reads/30% Write mix?

A single data node in a DVX, with attached hosts, can do 405K IOPS. The largest XtremeIO brick (spec here) can do 220K IOPS.

The largest XtremeIO scaleout system can do 880K IOPS. The DVX can do 3.2 million IOPS. We’re 3.7 times faster than the biggest XtremeIO cluster money can buy. Infinidat, Unity, and XtremeIO publish 8KB 70/30 numbers & we blow past every single one of them by at-least 3.7X and sometimes 10X.

But make no mistake, what separates the men from the boys is random write throughput…especially on disk.