Dave McCrory ran a data strategy and consulting business before he joined Digital Realty. Over the last 25 years, McCrory led teams at GE Digital, Basho Technologies, Warner Music Group and others. He also co-founded Hyper9 (acquired by SolarWinds) and Surgient (acquired by Quest Software). McCrory holds more than nine technology patents in virtualization, cloud and systems management and created the concept of data gravity.”

Published in Issue 5 | December 2020

How did you come up with the idea of data gravity?

I started working on virtualization almost two decades ago. I realized its potential when I was at Surgient. We filed a number of patents, including the first patent on cloud computing—a logical virtualized server cloud. From 2000-2010 there was significant growth of the cloud. By 2010, it had started to take off. Back then, I worked for Dell Data Center Solutions group. While evaluating cloud providers, I noticed the data was growing significantly. As the data grew, it attracted both services and applications closer to the data. That reminded me of gravity and led me to write a blog post about ‘data gravity in the clouds’. There was a virtuous cycle of data—the more data you have, the more data you create, and applications wanted to be closer to the data. The reason for that was applications (being closer to the data) gained access to higher bandwidth and lower latencies.With this rationalization, I coined this term data gravity to apply to the concept.

What is data gravity’s role in cloud computing?

Cloud providers understand the importance of data gravity and its effects. Initially at least, it is usually less expensive for enterprises to store their data in cloud platforms. The cloud providers realize that enterprises are going to want leverage these same platforms to then run their applications, do analytics and have the partners access and use this data. All of that means that increased amounts of data is going to be attracted to their cloud. Therefore, the cloud providers benefit from the effects of data gravity, and they build ecosystems leveraging those effects.

How do you take advantage of data gravity in a colocation environment?

When it comes to data, it is important to understand what is creating and interacting with it, plus where it gets stored. Ideally, data analytics and data processing should be done where the data resides. However, if your consumers are distributed, then you need to also distribute the data as rapidly as possible. In case of an enterprise, when working with large amounts of data, you find ways to work with that data in the most efficient ways. Some enterprises work with all of their data in one place. If the data is not being consumed all in that one place, then the data architecture needs to change so that data is at the center. The emphasis is on a data-centric architecture instead of backhaul models, where processing is at the center. If there is a reason you need to move the data, then you look for the best ways and locations. Those locations include the highly connected facilities where your company and business partners come together to exchange data in a low-latency, high-bandwidth environment.

I think of bandwidth as lanes on a highway. The more lanes you have, the more traffic you can support on the highway. So, the more bandwidth you have, the better off you are in moving this data around and facilitating it being put onto and taken off something like storage.

What is the role of data gravity at the edge?

Data processing happens at the core. The emphasis is on the low-latency delivery of applications. Edge is important for faster delivery of data. However, the edge has its limitations. You neither have infinite storage, nor do you have infinite bandwidth, at the edge. You may have edge locations globally, but each location is not going to communicate with all the others. Therefore, the data has to go to either an intermediate or core facility to get aggregated. In each stage, data gravity has its effects. At the edge, you are bound by either physical limitations with its location, network bandwidth or latency. It is important to get the data in one place, so that location has high amount of gravity. Then, this location attracts partners and customers to connect to the facility, which enables them to access the data quickly and easily.

From an enterprise perspective, if you look at a SaaS application provider, they either have their own cloud or they are in a partner cloud. That is where you would want to be to interact with their applications and data. Generally, enterprises spread their workload across different cloud providers. They want to access applications in a variety of clouds, so their data gravitates towards those. Hence, data gravity is a driver for multi-cloud concept. As more and more data is generated and distributed, it brings data gravity even more to the forefront because it affects more things, people, businesses and industries in a much deeper way

What is the methodology behind the Data Gravity Index DGxTM?

The components of data gravity are included in the formula: data mass x data activity x bandwidth = result divided by latency squared. Data mass is data at rest—or all of the stored data, including data stored in network and in traditional storage drive.This is considered a potential data mass variable. Then data activity is data in motion, either moving across the network or being processed or sent to storage. Next, you have the amount of bandwidth. I think of bandwidth as lanes on a highway. The more lanes you have, the more traffic you can support on the highway. Therefore, the more bandwidth you have, the better off you are in moving this data around and facilitating it being written to and read from storage. All of the above components are brought together as they encompass a set of variables that amplify the gravity intensity.

Everything is divided by latency squared because effectively, it is the speed limit of the highway. A high-speed limit would be good if you’re trying to move across the highway quickly. But because latency is time-based, it’s actually better to have lower latency. In fact, in the formula, if you had a latency of one millisecond and squared it, then it would still be one. For the vast majority of cases, just like a high-speed limit, one millisecond or less of latency is optimal.

At higher levels of latency, the impact gets to be fairly profound. When you move into hundreds of milliseconds, systems begin to be more dramatically affected by latency. If you move beyond half of a second, people start to notice. Hence, the squaring is supposed to amplify that effect. Because latency has such an extreme impact, there have been studies of latency impacting human behavior. The extreme is that with enough latency then there’s no difference between latency and downtime. Imagine if you were attempting to use a website and the latency was an hour, you would assume that the system was down. It might still be working, but it’s just so incredibly slow that you assume that it’s broken. At the same time, if the latency was two milliseconds or 10 milliseconds it seems instant to you. Therefore, it is important to be as close to your data as possible, as it offers the greatest opportunity to get the lowest latency and the maximum bandwidth.