Lecture Notes for CS525
Lecture
Added: [2020-08-24 Mon 17:38]
-
hash table
GPA -> students
-
binary tree
GPA -> students
1
SELECT * FROM students WHERE GPA > 3.5
-
can't be used
-
hash table
name > students
|
|
-
insert new student "Peter Smith"
Name | GPA | Major |
---|---|---|
Bob Bobsen | 3.5 | CS |
Alice Chenjie | 4.0 | CS |
Astrid Walsen | 2.3 | CS |
Algorithm 1
-
append to file
Name | GPA | Major |
---|---|---|
Bob Bobsen | 3.5 | CS |
Alice Chenjie | 4.0 | CS |
Astrid Walsen | 2.3 | CS |
Peter Smith | 1.0 | CS |
-
now sort
Name | GPA | Major | Bytes |
---|---|---|---|
Bob Bobsen | 3.5 | CS | 15 |
Alice Chenjie | 4.0 | CS | 18 |
Peter Smith | 1.0 | CS | 16 |
Astrid Walsen | 2.3 | CS | 18 |
-
delete Peter, change position of all of the rows after Peter $$O(n)$$
-
table
R
hasn
rows and tableS
hasm
rows. -
tables
R
,S
, andT
and each has 1000 rows
|
|
FOR EACH Row in R FOR EACH Row in S FOR EACH Row in T combine the row
$$O(n^3)$$ complexity
generalizing this: for an m-way join we have $$O(n^m)$$
misleading complexity
$$O(n) = 1 \cdots n$$ $$O(log n) = 10^9 \cdots n$$ O-notation ignores constant factors
Lecture
Lecture
Alignment and packing
-
C
adds padding to align data types to avoid having to move values at runtime when accessing them1 2 3 4 5 6 7 8 9 10 11 12 13 14
#include <stdio.h> typedef struct A { char a; // 1 byte -- at Memory location 0 // inject padding 7 byte long b; // 8 bytes -- at memory location 8 } A; int main(int argc, char *argv[]) { A *x = malloc(sizeof(A)); // = sizeof(char) + sizeof(long) - 1 + sizeof(long) printf("the size of A is %u", sizeof(A)); return 0; }
the size of A is 16
-
you can force the compiler to not align to save space, e.g., in
gcc
use__attribute__((packed))
1 2 3 4 5 6 7 8 9 10 11 12 13
#include <stdio.h> typedef struct A { char a; // 1 byte -- at Memory location 0 long b; // 8 bytes -- at memory location 1 (not aligned to 8 byte boundry) } __attribute__((packed)) A; int main(int argc, char *argv[]) { A *x = malloc(sizeof(A)); // = sizeof(char) + sizeof(long) - 1 + sizeof(long) printf("the size of A is %u", sizeof(A)); return 0; }
the size of A is 9
Row versus Column
A (2 byte int) | B (1 byte char) |
---|---|
13 | a |
15 | b |
16 | c |
Row-oriented storage
-
assume that 6 bytes stored on each pages (no page headers)
Block 1 [13,a] [15,b] Block 2 [16,c]
Column-oriented storage
Block 1 [13,15,16] Block 2 [a,b,c]
|
|
Sorting and binary association tables
resort A: [16,15,13]
-
loose the value-row association!
-
solution: binary tables
RID | A |
---|---|
1 | 13 |
2 | 15 |
3 | 16 |
-
sorting can now be done without loosing the association between values and rows.
RID | A |
---|---|
3 | 16 |
2 | 15 |
1 | 13 |
Hybrid formats
-
e.g., store data column within the page, but still each record is stored on a single page (or use chunks that are multiple pages each)
Block 1 [13,15] [a,b] Block 2 [16],[c]
Lecture
Lecture
Tree depth calculation
Binary trees
-
each node has two children
1024
entries -> depth
$$\lceil log_2(1024)\rceil = 10$$
B+-trees
-
each node is one block size
-
e.g., 256 children per node
1024
entries -> depth
$$\lceil log_{256}(1024) \rceil = 2$$
Lecture
Lecture
Multi-key indexes
-
table
R(A,B,C,D)
-
#rows = 10,000,000
-
#rows where A = 3: 1,000,000
-
#rows where B = 4: 2,000,000
-
#rows where A = 3 AND B = 4: 300
-
-
index on A and an index on B
|
|
-
index A and B
Multi-key B-tree?
-
B-tree on (A,B)
A | B | C | D |
---|---|---|---|
1 | 1 | a | b |
1 | 2 | a | c |
3 | 4 | a | d |
-
Index 1: Keys (A,B):
-
"1,1"
-
"1,2"
-
"3,4"
-
-
Keys (B,A):
-
"1,1"
-
"2,1"
-
"4,3"
-
-
Can we use index 1 to answer a query
-
Queries on A,B: YES
-
Queries on A: YES, example
A=3
-
Queries on B: NO, example
B=2
-
Grid files
-
2 dimensions with 100 values
-
number of directory entries?
-
-
10 dimensions with 100 values
-
number of directory entries?
Lecture
|
|
Lecture
Order-by
-
relations (bags and sets are unordered)
-
order-by cannot exist as an operator
-
-
order as data
A | B |
---|---|
3 | 4 |
1 | 5 |
7 | 2 |
|
|
-
query result
A | B |
---|---|
1 | 5 |
3 | 4 |
7 | 2 |
-
algebra result (order by operator) $$\omega_{A \rightarrow position}(R)$$
A | B | position |
---|---|---|
3 | 4 | 2 |
1 | 5 | 1 |
7 | 2 | 3 |
|
|
$$\sigma_{position \leq 3}(\omega_{salary \to position}(Employee)$$
|
|
$$q_{max} = \pi_{name, salary}(\alpha_{max(salary) \to X}(Employee) \bowtie_{X = salary} (Employee)))$$ $$q_{top-2} = \pi_{name, salary}(\alpha_{max(salary) \to X}(\pi_{salary}(Employee) - \alpha_{max(salary)}(Employee)) \bowtie_{X = salary} (Employee)))$$ $$q = q_{max} \cup q_{top-2}$$
|
|
-
unnesting
-
decorrelation
|
|
$$\pi_{A}(\sigma_{A = 3}(R \times S))$$
$$(\pi_{A}(\sigma_{A=3}(R)) \times \pi_{}(S))$$
Lecture
A | B | size |
---|---|---|
1 | asdasd | 5 |
2 | asdasdsadasdasd | 10 |
3 | asdasda | 6 |
$$S(R) = 7$$
$$\sigma_{A = 1}(R)$$
A | B | size |
---|---|---|
1 | asdasd | 5 |
A | B |
---|---|
1 | 2 |
1 | 2 |
3 | 4 |
4 | 5 |
V(R,A) = 3
$$\sigma_{A = 4}(R)$$
A | B |
---|---|
4 | 5 |
multi-dimensional histogram
-
10 buckets per attribute
-
$$n$$ and $$B$$ buckets, then n-dimensional histogram $$B^n$$ buckets.
Lecture
Lecture
Lecture
Question 1.1.6 (transitive closure with recursion)
digraph {
rankdir=LR;
a1 -> a2;
a1 -> a3;
a2 -> a4;
a4 -> a5;
}
author | coauthor |
---|---|
a1 | a2 |
a1 | a3 |
a2 | a4 |
a4 | a5 |
-
direct co-authors
|
|
-
direct one step indirect co-authors
|
|
-
indirect up to two step indirect co-authors
|
|
recursive queries
-
user provides $$Q_{init} \cup Q_{step}$$
-
Initialization $$D_0 = Q_{init}(D)$$
-
Iteration 1: $$D_1 = Q_{step}(D_0)$$
-
…
-
Iteration $i$: $$D_{i+1} = Q_{step}(D_i)$$
-
Fix point computation: stop when $$D_{i} = D_{i-1}$$
Quiz 1 - 1.4.3 - result size estimation
-
query $$q = (student \bowtie_{name=student} \sigma_{hoursAttended =15}(attendsLecture) \bowtie_{lecture= title} lecture$$
subqueries
-
query Q1 $$q_1 = \sigma_{hoursAttended =15}(attendsLecture)$$
-
query Q2 $$q_2 = student \bowtie_{name=student} q_1$$
-
query Q3 $$q = q_2 \bowtie_{lecture=title} lecture$$
-
every subquery is just one operator
-
which might access other subqueries' results
-
ok, as long as we can estimate statistics for the results
Solve Subquery Q1
$$T(q_1) =\frac{T(attendsLecture)}{V(attendsLecture,hoursAttended)} = \frac{40,000}{100} = 400 $$ $$V(q_1, student) = min(V(attendsLecture,student),T(q_1)) = 400$$ $$V(q_1, lecture) = V(attendsLecture,lecture) = 8$$
Solve Subquery Q2
$$T(q_2) = \frac{T(students) \times T(q_1)}{max(V(student,name), V(q_1,student))} = \frac{8000 \times 400}{max(7500,400)} = \frac{8000 \times 400}{7500}$$ $$V(q_2,lecture) = V(q_1,lecture) = 8$$
Solve Final query Q
$$T(q) = \frac{T(q_2) \times T(lecture)}{max(V(q_2,lecture), V(lecture,title))} = \frac{8000 \times 400 \times 10}{7500 \times max(8,10)} = \frac{320000}{75} = 426.666666667 $$
Lecture
Exam Questions
-
exam duration: lecture duration
-
exam handed out at lecture start through blackboard
-
exam will be submitted to blackboard (send an email also just in case)
I/O estimation
$$R \bowtie S$$ $$B(R) = 10,000$$ $$B(S) = 30$$ $$M = 101$$
Block nested loop
$$ \lceil \frac{B(S)}{M-1} \rceil \times (min(B(S),M-1) + B(R))$$
$$ = \lceil \frac{30}{100} \rceil \times (30 + 10,000)$$ $$ = 1 \times 10,030 = 10,030$$
Merge join
-
are inputs sorted?
-
if yes
-
$$B(R) + B(S) = 10,000 + 30 = 10,030$$
-
if no
-
cost sorting R + cost of sorting S + B(R) + B(S)
-
sorting R $$2 \times B(R) \times (1 + \lceil \log_{M-1}(\frac{B(R)}{M}))$$ $$2 \times 10,000 \times (1 + \lceil \log_{100}(\frac{10,000}{101}))$$ $$ 20,000 \times (1 + 1) = 40,000$$
-
sorting S $$2 \times B(S) \times (1 + \lceil \log_{M-1}(\frac{B(S)}{M}))$$ $$2 \times 30 \times (1 + 0) = 60$$
-
total cost $$40,000 + 60 + 10,000 + 30 = 50,090$$
-
trick is if last merge phases can be done in memory in parallel, then we can save: $$2 \times (B(R) + B(S))$$
-
trick does not apply
101
for last merge ofR
and30
for last merge ofS
Hash join
-
if one of the tables fits into memory then (true for this example)
$$B(S) + B(R) = 10,000 + 30 = 10,030$$
-
if not, then we need
$$(B(R) + B(S)) (1 + 2 \times (\lceil \log_{M-1}(min(B(R),B(S))) \rceil - 1)$$
Lecture
why not use crossproduct always to avoid join reordering?
-
counter example
-
10 tables $$R_i$$ each has 1000 rows
-
size of the cross produce $$1000^10 = 10^30$$ with using $$10^6$$ cores, $$10^24$$
Lecture
Lecture
Lecture
|
|