Dropbox's Distributed Storage System

Magic Pocket is Dropbox’s distributed storage system. It prioritizes simplicity, durability and availability, with a multi-level design that uses centralized coordination where possible. Data is replicated on all levels: between zones (which span an entire region, like the Eastern US) but also within each zone. The author notes in a HN comment the high-level similarities with GFS, which may warrant a closer look.

Below the level of zone is the cell, the largest storage unit with central coordination. The cell contains volumes, which are replicated over several physical storage units, or OSDs. The cell’s master controls tasks like the creation and deletion of buckets, and repair of volumes in case of OSD failure. The data flow of the cell works independently of the master, though: a separate replication table maps buckets to physical storage locations, keeping it working in case of failure of the master. The master’s protocol ensures consistency in the event of failure or restart of any component, notably by storing the generation of each volume-to-OSD mapping with the volume and in the replication table. Eg, if the master hangs in the middle of a repair process and later comes back to life, it will not try to continue that repair (or will at least not leave the system inconsistent - the exact mechanism that prevents that is unclear).

Some other characteristics: the smallest units of file storage are immutable, like git’s or IPFS’s: changes are written to a journal and each change creates a new block. This prevents inconsistency that could result from updating some copies of the file but not others. It’s also optimizing for temporal locality – files that get changed are the most likely to get changed again soon – while making sure that even old files can be accessed quickly. It does this, per the author’s comment on HN, by applying the above replication scheme for 24 hours after a volume is created. After that, data is replicated by a more efficient erasure-coding system.

In another interesting HN comment, the author points out that the blocks are actual blocks on the storage device, directly managed without any intermediate filesystem.