How data is stored using Cloud Storage

Once people understand the basics of cloud storage a common questions is “which part of the cloud has my data and how does it get to me?” There’s not necessarily one answer. A number of factors such as the size of your data and the technologies being used to manage  the logical storage pool the cloud is using may determine exactly where your data is stored. It’s possible your data is on a single storage device somewhere. It’s almost equally possible that your data is split up into multiple pieces across multiple storage devices around the world.

The short answer is that your data is all over the cloud. To understand this you need to know that cloud storage systems operate quite differently than a normal computer. Cloud storage typically treats data as objects and then organizes those objects in a very specific fashion. Two of the primary goals of these storage systems is to work as quickly as possible and to make sure the data is always available. To accomplish these goals the data, called objects, is split up into clusters and is replicated across multiple storage devices. Splitting up the data into objects and then assigning it to a cluster is an organizational system that when used with a master, acting as an index, makes finding and reading data very fast. The second part is the replication of the data. Replicating the data keeps it accessible if the original is unavailable because the data is replicated at multiple locations.  In some systems this also means if one location is faster than another the system may choose to read from the faster replica.

By Robert.Harker (Own work) [CC BY-SA 3.0]

Depending on which particular system is being used there are different trade offs. Google’s “Google File System” or “GFS” is highly focused on handling large numbers of requests and returning those requests for data as fast as possible.  What is sacrificed to achieve this delivery speed is what could be called data housekeeping. Since Google’s GFS is focused on data delivery it means that not all of the replicas of data may be current. Each replica of data will eventually become current but that is treated as a secondary objective. The GFS system will return which ever replica is the fastest regardless of if it’s the most up-to-date.

 

The other end of the spectrum is Amazon’s Dynamo system. The Dynamo system also keeps replicas of data and aims to quickly return requests but also is focused on ensuring that only the most up-to-date data is returned. In order to accomplish this, when data is requested all of the replicas of the data respond showing when it was last modified. The most recent replica is returned and any out-of-date replicas are then updated after the request was fulfilled. This means you can be fairly certain the data you get will be the most current version but it takes a little extra time to process.

 

It can’t be stress enough that using GFS systems will not leave you working with old data or that you’ll have slow requests on a Dynamo system. Both GFS and Dynamo are very fast and rarely return out of date information. The systems where simply designed to align with the different priorities of the companies who made them. The difference between these various systems come out when you start dealing with the millions or even billions of requests they handle each day. All cloud storage systems have an above all focus of speed, reliability, and elasticity (ability to expand or contract). All the major cloud storage systems work very well and they also all borrow ideas from each other. As the end user that means you have a number of options available with similar features that drives competition in the market. These are very advanced systems that are much more complicated than even what is present here. Some people are OK just letting the magic occur behind the scenes and that’s fine. Other people want to peek behind the curtain and see the inner working and this is that peek at the workings of cloud storage. If you want to read a much more in depth and technical look at cloud storage systems take a look at this great article by Ars Technica.