Hyperconverged Infrastructures (HCI), have been touted to solve traditional storage array problems and have some very nice properties. With HCI, storage is moved into the compute nodes, and if the storage is flash, then that’s exactly where you want it to be. The filesystem can leverage unused CPU cycles on the compute node and when you add a new server, you’re adding more compute. Since filesystem compute resources scale with servers, you no longer have the traditional storage array worry about maxing out the storage compute resources. Ease of use is improved because compute and storage are managed as one. All HCI systems support VM-based management, which is the right granularity. However, along with those desirable properties come some serious dark sides.
Safety doesn’t happen by accident
HCI systems typically mirror data across multiple participating nodes for data availability. There is a choice at configuration time between keeping 2 or 3 copies of data. The reason they give you a choice to configure only 2 copies is because it’s expensive to store 3 copies of the same data. But that cost savings comes at a significant risk of data loss and/or data unavailability: the loss of a single node (it could simply be down for maintenance) and a single sector read error on the remaining copy can result in data loss at worst, or data unavailability at best. Adam Leventhal (Sun) has run the numbers on this – “RAID-1 can be viewed as a degenerate form of RAID-5, so even if bit error rates improve at the same rate as hard-drive capacities, the time to repair for RAID-1 could become debilitating. How secure would an administrator be running without redundancy for a week-long scrub?” Who wants RAID-5 anyway?
If you care about your data, you really need to configure for 3 copies —and now you’re paying through the nose, especially on flash. Most arrays support always-on Erasure coding, but it is a checkbox on HCI systems. The checkbox defaults to off with many ifs and buts attached. Even when enabled, data is still first written in mirrored form and only later post-processed and re-written in Erasure Coded form—so there is a long window of vulnerability until the system deems the data “Write-Cold”. In addition, Erasure Coding in an HCI systems results in the data sprayed over all nodes, which increases latency and network load for writes and rebuilds. Worse yet, it can result in reads over the network, that’s why there is an emphasis on “cold” data. Overwrite-intensive (read: database) workloads will suffer a performance hit in HCI systems if you enable Erasure Coding. Not so simple anymore, is it?
Some things are not meant to be combined
Back when storage and compute were separate, you could take down one or more servers for maintenance without storage implications for the remaining servers; servers were stateless. HCI combines servers and durable copies of data, making server down time more problematic. When a server goes down (planned or unplanned) you don’t just lose compute resources, you lose one of the copies of your data. A single server loss is disruptive and requires re-mirroring of data to restore redundancy. Two servers down and you’ve eaten through 2 of your 3 copies and you’re running on the edge of disaster. If the downtime is planned you can evict data before taking a server down (some HCI vendors recommend this), but that will take time, network bandwidth, and compute on the remaining servers.
Thinking of enabling Erasure Coding? Well, one single server down means you must reconstruct and rewrite all the data on the server that went down—usually that means 6 to 10 times the data on the missing node flowing through the network. Try dealing with that when a storage-only node goes down.
Traffic is only one of the side effects of growth
Mirroring data results in network traffic to each of the mirror nodes. Mirrored writes result in “neighbor noise” from “east-west” network traffic between the nodes, which creates network performance issues at scale. Guaranteeing performance under load becomes challenging because every node is communicating with every other nodes. And debugging performance can be extremely difficult.
HCI vendors will claim they support heterogeneous clusters, but their support organizations will guide towards homogeneity. Otherwise you will have hot spots because the performance of any one node is gated by the performance of the slowest node with which it interacts. If you setup a cluster and grow it over the period of several years, you’re faced with the choice of buying out-of-date servers to match the existing cluster or buying new servers that potentially create an imbalance. This also becomes a problem when you have an application that needs a server with more memory or CPU than the other existing servers.
Storage and compute servers are now tied together so when you want to scale one you’re forced to scale the other. This means you are overpaying in one dimension or the other. This can be worked around by supporting the addition of storage-only nodes. These storage-only nodes will typically have more data than the compute-storage nodes, which means you need to buy two of them to handle the load when a storage-only node fails. Larger storage nodes would service more IO and provide additional IO hot-spot challenges. Also any data stored on a storage-only node is, by definition, remote from a VM’s compute server so data locality and low-latency reads are lost. Think of storage-only nodes as arrays with single points of failure and no erasure coding support.
You can’t get there from here
When you look below the surface it becomes clear that HCI is a faustian bargain with some serious dark sides. In exchange for better alignment with virtualization and a simpler management interface you must live with a bad data durability story, bad scaling story, expensive storage, poor performance, and maintenance headaches. These problems aren’t a result of convergence being a bad idea—the trouble is hyperconverged is over converged.
Can you get the essential benefits of converged storage without throwing out all the benefits of traditional storage arrays? Yes, turns out you can!