What do 1000’s of metrics in the new rackscale infrastructure solution do for private cloud customers?

We live and breathe data.  Every. Day. Even though we’ve engineered the industry’s first server powered storage system, that’s not what I mean by living and breathing data.

So, what do I mean by living and breathing data? Our team has pioneered industry changing technologies like VMware hypervisor, Java Virtual Machine, Data Domain system, and Android OS. Our collective engineering experience spans hard core file systems, virtualization, and systems management fields and we all share one common trait. We are extremely skeptical of any engineering claim or result without hard data. And that includes our own code.

To put this into perspective, we’ve designed on the order of 6000 metrics into our system, where a metric is any type of data that we track over time.  We can measure anything from IO latency of a single SSD to Data Node fan speed to numerous other internal software metrics that tell us exactly what’s happening at any moment.  For example, we track read/write IO latency for every vdisk in our system; with 2500 VM’s with 5 vdisks for each VM, our system is tracking 25000 data points every a few seconds just for read/write IO latency!  And we are adding new metrics everyday as new features are developed.  Using hard data to pinpoint possible issues or justify a major change is baked into our engineering culture.

And these metrics are always on. Customer systems in the field generate data points for the same metrics we use to develop and debug our systems during feature development.  With automated support connectivity, we get extremely fine visibility into deployed systems. This depth of system data enhances our front-line support capabilities and is partly why our support team is so highly rated by our customers.

All of this means that we’ve invested heavily in the software infrastructure for defining, computing, collecting, and analyzing metrics. I routinely joke with Ganesh Venkitachalam, one of our founders, that we could spin out a new product to compete with companies like Splunk if we wanted to add a new line of business.

At the risk of revealing too much of our secret sauce, here are just a couple of unique aspects of how things are engineered.

Metric definition, computation and standardization

To begin with, we’ve developed a language for specifying metrics. By standardizing how we define metrics, we were able to build a shared software infrastructure that automates much of the collection, analysis, and display of computed data points.

Metric definitions direct how the software infrastructure computes data points. For example, an engineer may specify that DVX system-wide average host IO latency metric is weighted by IOPS. The shared software infrastructure takes this instruction into account when analyzing IO latency data points from each host. The average latency data is computed using each hosts’ IOPS as a weighting factor automatically.

A metric definition may also specify how samples should be rolled up to coarser time intervals for long term trend analysis. Engineers have an option of specifying average, min, max, and other functions for each metric. The shared infrastructure automatically rolls up data points across time intervals using the specified functions.

In short, the shared software infrastructure automates much of the heavy lifting behind sampling raw values and computing derived data from them, so that we can focus on improving the product. The same shared software infrastructure also enables our CLI and UI to display metrics by simply referencing their names. This means we can iterate quickly on which metrics to bubble up to customers in our GUI. 

Server powered

To scale metric data processing as number of hosts and VM’s grows, we’ve distributed the real-time and 5-min data computation to the DVX enabled servers.

For example, VM IO latency metric is computed from analyzing individual vdisk IO latency with IOPS weighting.  As you can imagine, the CPU and memory requirements for computing such metric increases linearly as the number of VM’s grows.  With just 2500 VM’s with 5 vdisks for each VM, we need to analyze 25,000 individual vdisk read/write IO latency metric data points to compute just the read/write IO latency metrics for all VM’s. All of this happening real time every a few seconds as IO is flowing and we have close to 6000 metrics and growing!

Some legacy systems in the market do not provide VM-level metrics at all. Others that offer such data have severe limitations on the number of metrics and/or time interval at which data points are computed. Some even require external database setup to track such metrics for large deployments or have to offload data processing to an externally hosted service outside your datacenter.

Datrium DVX provides extremely fine grained vdisk, VM, host, and system-wide metrics out of the box.   No need to set up any software or use a separate service!

Just as DVX data path scales IO performance by adding more servers, we also increase our metric computation capacity automatically as hosts are added to DVX system. We collocate the computation of VM metrics with servers where VM’s are actually running. As a VM performs IO on vdisks, we measure raw latency numbers of each vdisk and compute VM level latency immediately right on the same host as IO is happening. We avoid moving the metric data around to prevent network chatter.  And these computed data points are pulled from hosts on demand when user requests them through our UI with near instantaneous results.

Time series database

Spotting performance anomalies often require analyzing various VM metrics over a wide time span. Long term trend data is also crucial to server IO performance and capacity planning.

We’ve developed a time series database specifically for these use cases.  DVX UI can chart up to 1 year’s worth of metrics data points for VM’s and hosts. Most importantly, these charts are fully interactive with users being able to select individual time slices, zoom into specific intervals, and examine key metrics hotspot rankings at each time slice.  And all the queries are executed in real time against DVX time series database as users analyze the historical charts.

Such capabilities are often delegated to external monitoring solutions coupled with dedicated databases. Datrium DVX provides these functionalities out of the box without any additional software!

We built this database using our core file systems technology. Once servers compute 5-min rollup data points, time series database persists and indexes them such that we can:

  • Maximize spatial locality by co-locating each VM’s data points for different metrics. This enables prefetching large number of metrics users are most likely to query for when analyzing a VM’s performance.
  • Maximize temporal locality by co-locating each VM’s data points across time.  Performance analysis rarely ever involves a single time slice; users are normally interested in a time range and need to query for data points spanning that time interval.  By keeping data points nearby in time together, we can prefetch nearby data in historical charts.
  • Pre-compute hotspot rankings for each time slice. And the ranking data exploits the same spatial and temporal locality to accelerate user queries.
  • Index each data point so that we can locate any specific metric quickly.

Because we built this database technology on our file system, there is no separate system to maintain or troubleshoot.  And the metrics data are as highly available and protected as your VM’s!

VM analytics simplified

We constantly add new metrics to better measure how field deployed systems are performing and we never stop finding ways to simplify performance management for our customers. We’re well on our way to solving the end-to-end VM performance management complexity problem for our customers. Now you have a sense of how important VM performance analytics is to us and why.