Log Structured Merge Trees

The log-structured merge tree is used in the storage engine of many databases, including Cassandra1. It describes a data structure, often contrasted with B-trees, that’s mainly designed as a strategy for managing writes to disk.

The basic approach is this: Maintain a collection of stores, increasing geometrically in size from the top down. Store records in the top level first. When level L_i fills up, merge it into L_i+1, emptying L_i.

Each level is itself a key-value store. Lower levels are implemented as B+ trees, filled in breadth-first order to avoid wasting space and allow for more sequential reads. Lookups proceed by checking for the key in L_0, then proceeding to the next level if the key isn’t found.

        [][][]               <--- L_0: memory (updated in place)
      [][][][][][]           <--- L_1: disk (updated with a merge)
[][][][][][][][][][][][]     <--- L_2: disk (updated with a merge)

The log-structured approach contrasts with the update-in-place approach: while the former does sequential writes, relying on additional sequential IO to manage the structure asynchronously, the latter does random reads – relying on a single seek for each read.

The bLSM2 authors note the choice that web services face when storing small objects: optimize for low-latency reads, or for high-throughput writes. B-trees meet the former requirement, and LSMTs meet the latter by making sure that all writes to disk are batched.