clock menu more-arrow no yes

Filed under:

How the giants of the web store giant amounts of data

New, 9 comments

An Ars Technica article has revealed details of the systems used by Google, Microsoft, Amazon, and Yahoo to store huge amounts of data.

Google Data Center
Google Data Center

Our increasing shift to cloud storage has required the giants of the web to rethink the file systems used to read and write the mountains of data which are accessed every second. An Ars Technica article reveals how these distributed file systems (DFS) are in use today at Google, Yahoo, Microsoft, and Amazon's data centers. It also explains why this new paradigm in file handling is necessary, as well as the ways in which the systems differ from traditional desktop file systems. The databases that power products like Google Search can include files many hundreds of megabytes (if not gigabytes) in size, which need to be accessible by many users at once — something that other file systems which lock files to prevent corruption are unable to do.

The lengths that each company has gone to to overcome these challenges are astonishing. Google has developed its own file system (called GFS), designed to turn many low-cost servers and hard drives into reliable storage for the masses of data it uses. Unlike GFS, the secrets of which Google keeps close to its chest, the other solutions are more widely available. Amazon's solution Dynamo has recently been made available to developers via Amazon Web Services, and while it has many similarities to GFS, it's also capable of time-based consistency checking meaning that only the latest changes are saved. Microsoft's Azure is designed for cloud use, and uses a similar consistency check to Amazon, though is far stricter on how data writes are enforced. Hadoop, originally developed by engineers at Yahoo, is freely available and shares many of the benefits of GFS, though can work on a wide variety of platforms or even be mounted on a regular PC via FUSE.