# CS595 - Storage - Data Placement **Lecturer**: [Boris Glavic](http://www.cs.iit.edu/~glavic/) **Semester**: fall 2021
# 2. Distributed Storage ## Data Placement
## What is Data Placement? - **Decide what data to store on which node of a cluster**
## Considerations - **Balanced storage load** - place data on nodes to achieve (roughly) even storage utilization - **Enable local computations** - can schedule computation on the node storing the data (reduced network traffic) - **Data skipping** - avoid reading data that can be shown to be irrelevant for the computation
## "Semantic" Data Placement - **How can we determine upfront which computation needs what data?** - **Use *semantic* properties for data placement** - place data based on its characteristics - *e.g., place all customers from a region (e.g., US) on the same node* - **Data skipping** - *e.g., do not read US customer data when computing revenue for region Europe*
## Data Partitioning - For storing (semi-)structured data - *e.g., relations or key-value pairs* - Place the data by partitioning collections of data items into subsets called **fragments** - *e.g., one subset of a table per node in the cluster*
## Type of Partitioning - **Horizontal Partitioning** - split a relation into **fragments** each storing a subset of the relation's rows - **Vertical Partitioning** - split a relation into **fragments** each storing a subset of the relation's columns
## Horizontal Partitioning - Given relation $R$, a **horizontal partition** of $R$ is a set of relations called **fragments** $R_1$, ..., $R_n$ such that: - the fragments are disjoint: $\forall i \neq j \in [1,n]: R_i \cap R_j = \emptyset$ - all rows are accounted for: $R_1 \cup \ldots \cup R_n = R$ - $\forall i \in [1,n]$: $R_i$ has the same schema as $R$
## Horizontal Partitioning - Example - `Items` | Item | Category | Price | |:----------|:----------|:------| | Lawnmover | Gardening | 1300 | | Shovel | Gardening | 14 | | Microwave | Kitchen | 40 | | Spoon | Kitchen | 3 |
## Horizontal Partitioning - Example - `Item_1` | Item | Category | Price | |:----------|:----------|:------| | Lawnmover | Gardening | 1300 | | Shovel | Gardening | 14 | | Microwave | Kitchen | 40 |
## Horizontal Partitioning - Example - `Item_2` | Item | Category | Price | |:----------|:----------|:------| | Spoon | Kitchen | 3 |
## Types of Horizontal Partitioning - **Hash Partitioning** - compute a hash function $h$ over a (set of) attribute(s) to map rows to fragments: - fragment $f_i = \\{ t \mid t \in R \land h(t.A) = i \\}$ - **Range Partitioning** - split the domain $\mathcal{D}(A)$ of an attribute $A$ into a set of ranges $\mathcal{R}_1$, ,..., $\mathcal{R}_n$ - fragment $f_i = \\{ t \mid t \in R \land t.A \in \mathcal{R}_i \\}$
## Hash Partitioning - Example - hash partition on `Category` with $h(Gardening) = 0$ and $h(Kitchen) = 1$ | Item | Category | Price | Fragment | |:----------|:----------|:------|---------:| | Lawnmover | Gardening | 1300 | $f_0$ | | Shovel | Gardening | 14 | $f_0$ | | Microwave | Kitchen | 40 | $f_1$ | | Spoon | Kitchen | 3 | $f_1$ |
## Range Partitioning - Example - range partition on `Price` using - $\mathcal{R}_1 = [1,20], \mathcal{R}_2 = [21,100], \mathcal{R}_3 = [101,3000]$ | Item | Category | Price | Fragment | |:----------|:----------|:------|---------:| | Lawnmover | Gardening | 1300 | $f_3$ | | Shovel | Gardening | 14 | $f_1$ | | Microwave | Kitchen | 40 | $f_2$ | | Spoon | Kitchen | 3 | $f_1$ |
## Vertical Partitioning - let $PK(R)$ be the primary key of $R$ - split the columns in ${Schema}(R)$ into subsets $S_1, \ldots, S_n$ such that: - all columns are accounted for: $\bigcup_{i=1}^{n} S_i = Schema(R)$ - each set of columns contains the primary key $\forall i \in [1,n]: PK(R) \in S_i$ - fragment $f_i = \\{ t.S_i \mid t \in R \\}$ - Note that $R_1 \bowtie \ldots \bowtie R_n = R$
## Vertical Partitioning - Example - assume that `Item` is the primary key - $S_1 = \\{Item, Category\\}, S_2 = \\{ Item, Category \\}$
## Vertical Partitioning - Example - fragment $f_1$ | Item | Category | |:----------|:----------| | Lawnmover | Gardening | | Shovel | Gardening | | Microwave | Kitchen | | Spoon | Kitchen |
## Vertical Partitioning - Example - fragment $f_2$ | Item | Price | |:----------|:------| | Lawnmover | 1300 | | Shovel | 14 | | Microwave | 40 | | Spoon | 3 |
## Balancing Storage Load - Do these partitioning methods provide any balancing of storage load? - **Vertical partitioning** - each node gets the same number of rows - column values may be different sizes though - coarse-grained partitioning, e.g., number of nodes > number of columns - **Hash partitioning** - "good" hash functions lead to an even distribution of data - the chance that there is a big imbalance is neglectable for large enough datasets - **skew**: when a single value is repeated many times then an large fraction of the data will end up on one node
## Example Skew - consider hash partitioning on `Author` for 100 nodes | Author | Tweet | node | |:--------------|:---------------------|------| | Famous Person | My new movie | 0 | | Famous Person | Eating something | 0 | | Famous Person | Hello | 0 | | Famous Person | Goodbye | 0 | | Pete Nobody | My vacation: ... | 1 | | Alice Nobody | Tips for travel, ... | 3 |
## Balancing Storage Load (cont.) - **Range partitioning** - selection of ranges determines distribution of storage load - how to determine ranges that lead to a good distribution of load? - still same problem with skew - even if we partition data perfectly, imbalance can result if data distribution changes over time
## Equi-depth vs. Equi-width Partitioning - Two common methods for range partitioning are: - **equi-width**: divide the domain of partition attribute into ranges with equal number of values - **equi-depth**: decide ranges such that each range covers (roughly) the same amount of data
## Computing Equi-width Partitions - to generate an **equi-depth partition** with `n` fragments of a relation on attribute `A` do: - compute `min(A)` and `max(A)` (requires one scan over the data) - create ranges by dividing `[min(A),max(A)]` into $n$ fragments $f_i$: $f_i = [min(A) + \frac{(max(A) + 1 - min(A)) \cdot (i-1)}{n}, min(A) + \frac{(max(A) +1 - min(A)) \cdot i}{n} - 1]$ - can apply sampling to speed this up (at loss of accuracy)
## Equi-width Partitioning Example - partition on `Salary` with `3` fragments - `min(Salary) = 50, max(Salary) = 650` - fragments: $ \{ [50,249], [250,449], [450,650] \} $ | Name | Salary | |:--------|--------| | Peter | 100 | | Bob | 170 | | Alice | 50 | | Moon | 150 | | Gertrud | 450 | | Joe | 650 |
## Computing Equi-depth Partitions - To generate an **equi-width partition** of relation `R` with `n` fragments on attribute `A` do: - Sort `R` on `A` and compute $|R|$ - Scan through the sorted data and create a range based on the `A` value at $\frac{|R| \cdot (i-1)}{n}$ and $\frac{|R| \cdot i}{n}$
## Equi-depth Partitioning Example - Partition on `Salary` with `3` fragments - Sort table on `A`, determine ranges based on each 2 consecutive tuples - Fragments: $\{ [50,149], [150,449], [450,650] \}$ | Name | Salary | | :-------- | :------- | | Alice | 50 | | Peter | 100 | | Moon | 150 | | Bob | 170 | | Gertrud | 450 | | Joe | 650 |
## Equi-depth vs. Equi-width Partitioning - Equi-depth partitioning is more expensive (sort vs. scan) - Equi-depth partitioning is better if data is not evenly distributed - each fragment gets the same amount of data (unless there is skew)
## Exploiting Partitioning in Query Processing - **Skipping fragments based on selection conditions** - hash partitioning on $A$ with hash function $h$ - `A = const` => only fragment for $h(const)$ contains rows where $A= const$ - range partitioning on $A$ - `A = const` => only fragment for the range containing `const` is needed - `c1 <= A <=c2` => only fragments whose range overlaps with $[c1,c2]$
## Exploiting Partitioning in Query Processing - **Skip fragments with columns that are not accessed by a query** - this query only needs the `Item` and `Price` attributes ```sql SELECT Item FROM Items WHERE Price < 20 ```
## Exploiting Partitioning in Query Processing - **Enable operator implementations that require certain data placement** - **aggregation** - all rows from one group on a single node - **join** - data with same join keys on the same node - **we will discuss this when discussing distributed query processing / compute**
## Shuffle - Repartitioning Data across a Cluster - As we will learn later in the course, certain distributed algorithms require data to be distributed in specific way - If the data is currently not partitioned as needed, it needs to be redistributed - this operations is often referred to as a **Shuffle** - We will discuss implementations of shuffle later in [MapReduce](/~glavic/cs595/2021-fall/lectures/map_reduce.html) and [MapReduce](/~glavic/cs595/2021-fall/lectures/distributed_op_implementations.html).
## Example Shuffle - Schema: `R(A,B)` - The table is hash partitioned across 4 nodes on attribute `A` using `h(x) = x` - Repartition the table on attribute `B` using the same hash function
G
n1
Node 1
A
B
1
2
1
3
n2
Node 2
A
B
2
2
2
4
n1->n2
(1,2)
n3
Node 3
A
B
3
1
3
1
n1->n3
(1,3)
n4
Node 4
A
B
4
2
4
4
n2->n4
(2,4)
n3->n1
(3,1),(3,1)
n4->n2
(4,2)
G
n1
Node 1
A
B
3
1
3
1
n3
Node 3
A
B
1
3
n2
Node 2
A
B
1
2
2
2
4
2
n4
Node 4
A
B
2
4
4
4
## Summary - Horizontal (hash or range) and vertical partitioning - Exploiting partitioning for computation