# CS595 - Storage - LSM-Trees **Lecturer**: [Boris Glavic](http://www.cs.iit.edu/~glavic/) **Semester**: fall 2021
# 2. Distributed Storage ## LSM-Trees
## Motivation - We will discuss an index structure called **LSM-trees** - This index structure is **write optimized**, i.e., provides high write throughput - **range-query performance** is sacrificed to some degree - LSM-trees are used in most implementations of key-value stores
## The History of LSM - LSM-tree were first proposed by O'Neil et al. in the 90s as a write-optimized index structure > The log-structured merge-tree (LSM-tree). Patrick O'Neil, Edward Cheng, Dieter Gawlick, Elizabeth O'Neil. Acta Informatica, 1996. - With the emergence of key-value stores (NoSQL) databases in the later 2000s, LSM-trees became widely deployed in real systems and optimizations were studied intensively in academia
## Survey of LSM Techniques - This paper provides a good overview of LSM-tree techniques : > Lsm-Based Storage Techniques: a Survey. Chen Luo, Michael J. Carey. VLDB J., vol. , 2020.
## LSM-trees - An LSM-tree is a data structure that indexes **key-value pairs** - **Supported operations** - **insert** `(key, value)` - insert a new key-value pair into the index - **update** `(key, value)` - set the value of `key` to be `value` - **point query** `key` - find the value `value` associated with `key` (or indicate that `key` does not exist in the index) - **range query** `[k1,k2]` - return all key-value pairs `(k,v)` such that $k \in [k1,k2]$ - **delete** `key` - delete `key` from the index
## LSM-tree Concepts - LSM-tree are **write-optimized** - LSM-tree use **out-of-place update** - instead of updating values of existing keys in-place, LSM-trees create a new entry on updates - LSM-trees consist of multiple **components** - each component is some regular index structure (we discuss two typical choices in the following) - not all components have to use the same type of index structure
## LSM-tree Structure - LSM-trees organize components into **levels** - the component(s) at the lowest level $L_0$ live in memory - this is sometimes called a **memtable** - the **memtable** is subject to a size limit $\mathcal{S}$ - components at lower levels are stored on disk and are **immutable** - components at level $L_{i+1}$ are a factor $T$ (**size ratio**) times larger than components at level $L_i$
## Leveling Merge Policy - Each level has a single component - When a component at level $l$ reaches its size limit $\mathcal{S} \cdot T^{l}$, it is merged into the component at level $l+1$ creating a new component at level $l+1$ that replaces the old component at level $l+1$ - A merge from level $l$ to level $l+1$ may cause the component at level $l+1$ to also overflow (reach its size limit) - that is a merge may trigger further merges at lower levels - A common size ratio is $T=10$, i.e., the size limit for the component at level $l+1$ is $10$ times the size limit for the component at level $l$
## LSM-tree Modification Operations - **memtable** - all insert/update/delete operations are applied to the memtable - when the **memtable** is full, it is merged into the disk-resident component at level `L1` and a **new** empty memtable is created - **disk components** - components at level $l > 0$ are stored on disk - these components are immutable, i.e., not updated in place - when such a component is full, it is merged into the component at the next level
## Existence of Keys at Multiple Levels - A key $k$ may exist in more than one component with different values - This happens when a component containing the key is merged into the next level - Because merging happens in increasing order of levels the following invariant holds: - If a key exists at levels $l$ and $l' > l$ then the key-value pair at level $l$ is newer than the one on level $l'$
## Merging components - To merge a component $C\_i$ at level $l$ into the component $C\_{i+1}$ at level $l+1$ - We scan through both $C\_i$ and $C\_{i+1}$ sorted by keys and output a merged component $C\_{new}$ - for keys that appear in both components we use the value from $C\_{i}$ because it is guaranteed to be newer (see previous slide) - This is essentially the merge phase of an external merge sort!
## LSM-tree Inserts - `Insert(k,v)` - to insert `k:v`, insert it into the memtable - how this works precisely depends on the type of index used as a memtable
## Insert / Update Example - Let's consider $T=2$ and $\mathcal{S} = 2$ (up to two entries in the memtable) - In the next slide we are inserting a new key-value pair `k3:v9` causing the memtable (`L0`) to exceed its size limit (2) which triggers: - creation of a new new memtable `L0'` - the current memtable `L0` gets merged with the current component at level `L1` to form the new component at level `L1`
G
l0
L0
k1:v7
k2:v8
l1
L1
k1:v5
k8:v6
l2
L2
k1:v1
k4:v2
k5:v3
k7:v4
G
i
insert (k3:v9)
l0p
L0'
k3:v9
i->l0p:1
l0
L0
k1:v7
k2:v8
l1
L1
k1:v5
k8:v6
merged
L1'
k1:v7
k2:v8
k8:v6
l0:2->merged:0
merge
l2
L2
k1:v1
k4:v2
k5:v3
k7:v4
## Point Queries - As discussed [here](#/keys-on-multi-levels) ... - a key may exist at multiple levels of an LSM-trees - if key $k$ exists at levels $l$ and $l'> l$, then the version at $l$ is newer - A point-query for key $k$ starts at level `L0` and continues until either - we have not found the key $k$ at the lowest level => the key is not in the index - we have found the key $k$ at the current level => return the value - This guarantees that we always return the latest version of the entry for $k$!
## Point Query Example - On the next slide we are searching for `k1` - The search starts at level `L0` and continues until either - we have not found the key at the lowest level => the key is not in the index - we have found the key => return the value
G
s1
1
s2
2
notl0
have(k1)? no
hasl1
have(k1)? yes
l0
L0'
k3:v9
l1
L1
k1:v7
k2:v8
k8:v6
l2
L2
k1:v1
k4:v2
k5:v3
k7:v4
## Range Queries - Given a range $[k\_1,k\_2]$ a range query returns all key-value pairs for keys $k \in [k\_1,k\_2]$ - Recall that a key may exist at multiple levels $l\_{i\_1}$, ..., $l\_{i\_k}$ - we need to only return newest value ($l\_{i_1}$) - The brute force solution would be to scan the components one-by-one and then merge the resulting sequences of key value pairs (as in merging components) - Possible optimizations: - we can scan all components in parallel - we can use heaps for merging the streams produced by each component scan
G
q
range(k2,k6)
l0
L0: k3:v9
q->l0
l1
L1: k1:v7,k2:v8,k5:v6
q->l1
l2
L2: k1:v1,k4:v2,k5:v3,k7:v4
q->l2
r0
k3:v9
l0->r0
result
r1
k2:v8,k5:v6
qr
final result: k2:v8,k3:v9,k4:v2,k5:v6
r0->qr
merge
l1->r1
result
r1->qr
merge
r2
k4:v2,k5:v3
l2->r2
result
r2->qr
merge
## LSM-Trees Deletion - deleting a key just from the memtable is not enough, because previous versions of the key may still exist at lower levels - **solution**: insert the key with a **tombstone**, a special value that indicates that the key has been deleted - changes to look-up: when finding a tombstone the search finishes and - changes to merging: merging a tombstone for key $k$ from $C\_i$ with a key from $C\_{i+1}$ results in $k$ to be excluded from the merged component
## Delete Example - assume $S = 3$ (up to three entries in the memtable) - delete key `k3` (exists in `L1`) and `k2` (not in `L0`) - instead of deleting `k3` and doing nothing about `k2` (which already does not exist in the memtable) we have to insert both with a tombstone
G
l0b
L0 (before)
k3:v9
l1
L1
k1:v7
k2:v8
k8:v6
l0a
L0 (after)
k3:Tombstone
k2:Tombstone
l2
L2
k1:v1
k4:v2
k5:v3
k7:v4
## Requirements for Indexes for Components - In principle any index structures for range-based access can be used to implement components of LSM-tree - Memtable and disk components have different access patterns - **Memtable**: mostly point queries and in-place updates and once in a while sorted access to all its entries (during merge) - **Disk-resident components** only require fast queries and sorted access to all entries during merge - Advantage of LSM: we can choose different data structures for the memtable and disk-based components
## Index Structures for Memtables - Since memtables only live in memory, we are not restricted to use index structures that can operate efficiently from secondary storage - e.g., [skip-lists](https://en.wikipedia.org/wiki/Skip_list) - For disk components we can exploit the immutability of the data - storing the data sorted on key with some additional lightweight index is feasible - common choices are **SSTables** (introduced next) and **B-trees**
## SSTables - An **SSTable** is a file that stores a set of key value pairs sorted on disk - key-value pairs are stored on blocks that are of a size that is a multiple of the disk block size - In addition the SSTable contains a lightweight index (stored at the end) which records which blocks store which range of keys - this is used to speed up queries (both point and range) - this index is small enough to be kept in memory
## Runtime Complexity of Operations
## Bloom Filters - A [bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is an over-approximation of a set - It supports two operations: - `insert(e)`: inserts the element $e$ into the bloom filter - `contains(e)`: determines whether element $e$ is in the bloom filter - Given any set of values $\mathcal{S} = \{ e_1, \ldots, e_n \}$ inserted into a bloom filter: - `contains(e)` for $e \in \mathcal{S}$ is guaranteed to return true **(no false negatives)** - `contains(e)` for $e \not\in \mathcal{S}$ returns true with a probability $\epsilon$ (called the **false positive rate**) and false otherwise **(false positives are possible)**
## How do Bloom Filters Work? - A bloom filter consists of a bit array $B$ of $m$ bits uses $k$ hash functions $h_1, \ldots, h_k$ - initially, all bits of the bloom filter are set to $0$ - `insert(B,e)` - for each $i \in [1,k]$ set $B[h_i(e)] = 1$ - `contains(B,e)` - return true if $\bigwedge_{i=1}{^k} B[h_i(e)]$, and false otherwise
## Bloom Filter Example - Consider $k = 2$ and $m = 6$ - $h_1(x) = x+1 % 6$ - $h_2(x) = x\cdot 2 % 6$
G
bloom
1
0
1
0
1
0
e2
insert 7
e2->bloom:2
h1=2
e2->bloom:2
h2=2
e1
insert 15
e1->bloom:4
h1=4
e1->bloom:0
h2=0
## Optimizing Bloom Filters - The $m$ and $k$ parameters of a bloom filter can be chosen to achieve a particular false positive rate $\epsilon$ for a given number of elements $n$ inserted into the filter - increasing $m$ reduces $\epsilon$, but increases the storage requirements of the bloom filter - increasing $k$ reduces $\epsilon$, but increases the computational cost of the `contains` and `insert` operations - it shown that the optimal number of bits per element is $m \simeq -1.44 \log_2 \epsilon$
## Using Bloom Filters to Speed-up Reads in LSM-trees - We can maintain for each component a bloom filter that over-approximates the keys stored in the component - These are typically small enough to live in memory - To determine whether the component contains a key $k$ - First check whether the bloom filter contains $k$ - If no, then $k$ is guaranteed to not be in the component - Otherwise we proceed to search for $k$ in the component - Assume bloom filters with a low false positive rate, e.g., 1%: - look-up in $\simeq 1$ times the look-up in a single component
## Leveled vs. Tiered Design - The design we have discussed so far is called level strategy / design - An alternative is the so-called **tiered design** which allows for multiple components at each level
## Summary - **LSM-trees** are write-optimized indexes used in many implementations of key-value stores - memtable + one or more disk components - Entries move over time from the memtable through the levels of disk components - leveled vs. tiered strategy - Writes are always sequential - merging of components is akin to the merge phase of external mergesort - Reads may require examining multiple components - => reads are slower (solved by using bloom filters) - Disk components are immutable - e.g., can be supported in append-only file systems like HDFS