Magic Pocket is Dropbox’s distributed storage system. It prioritizes simplicity, durability and availability, with a multi-level design that uses centralized coordination where possible. Data is replicated on all levels: between zones (which span an entire region, like the Eastern US) but also within each zone. The author notes in a HN comment the high-level similarities with GFS, which may warrant a closer look.
Below the level of zone
is the cell
, the largest storage unit with central coordination. The cell
contains volume
s, which are replicated over several physical storage units, or OSD
s. The cell’s master
controls tasks like the creation and deletion of buckets, and repair of volumes in case of OSD
failure. The data flow of the cell works independently of the master, though: a separate replication table
maps buckets to physical storage locations, keeping it working in case of failure of the master. The master’s protocol ensures consistency in the event of failure or restart of any component, notably by storing the generation
of each volume-to-OSD mapping with the volume and in the replication table. Eg, if the master hangs in the middle of a repair process and later comes back to life, it will not try to continue that repair (or will at least not leave the system inconsistent - the exact mechanism that prevents that is unclear).
Some other characteristics: the smallest units of file storage are immutable, like git’s or IPFS’s: changes are written to a journal and each change creates a new block. This prevents inconsistency that could result from updating some copies of the file but not others. It’s also optimizing for temporal locality – files that get changed are the most likely to get changed again soon – while making sure that even old files can be accessed quickly. It does this, per the author’s comment on HN, by applying the above replication scheme for 24 hours after a volume is created. After that, data is replicated by a more efficient erasure-coding system.
In another interesting HN comment, the author points out that the block
s are actual blocks on the storage device, directly managed without any intermediate filesystem.