Lecture Notes for CS525

Lecture <2020-08-24 Mon>

Added: [2020-08-24 Mon 17:38]

hash table GPA -> students

binary tree GPA -> students

1

SELECT * FROM students WHERE GPA > 3.5

can't be used
hash table name > students

1

SELECT * FROM students WHERE name = 'Jones'

insert new student "Peter Smith"

Name	GPA	Major
Bob Bobsen	3.5	CS
Alice Chenjie	4.0	CS
Astrid Walsen	2.3	CS

Algorithm 1

append to file

Name	GPA	Major
Bob Bobsen	3.5	CS
Alice Chenjie	4.0	CS
Astrid Walsen	2.3	CS
Peter Smith	1.0	CS

now sort

Name	GPA	Major	Bytes
Bob Bobsen	3.5	CS	15
Alice Chenjie	4.0	CS	18
Peter Smith	1.0	CS	16
Astrid Walsen	2.3	CS	18

delete Peter, change position of all of the rows after Peter $$O(n)$$
table R has n rows and table S has m rows.
tables R, S, and T and each has 1000 rows

1
2
3


SELECT *
FROM R, S, T
WHERE R.A = S.B AND S.C = T.D;

FOR EACH Row in R
   FOR EACH Row in S
       FOR EACH Row in T
           combine the row

$$O(n^3)$$ complexity

generalizing this: for an m-way join we have $$O(n^m)$$

misleading complexity

$$O(n) = 1 \cdots n$$ $$O(log n) = 10^9 \cdots n$$ O-notation ignores constant factors

Lecture <2020-08-26 Wed>

Lecture <2020-08-31 Mon>

Alignment and packing

C adds padding to align data types to avoid having to move values at runtime when accessing them

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


  #include <stdio.h>

  typedef struct A {
    char a;  // 1 byte -- at Memory location 0
    // inject padding 7 byte
    long b;  // 8 bytes -- at memory location 8
  } A;

  int main(int argc, char *argv[])
  {
      A *x = malloc(sizeof(A)); // = sizeof(char) + sizeof(long) - 1 + sizeof(long)
      printf("the size of A is %u", sizeof(A));
      return 0;
  }

the size of A is 16

you can force the compiler to not align to save space, e.g., in gcc use __attribute__((packed))

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


  #include <stdio.h>

  typedef struct A {
      char a;  // 1 byte -- at Memory location 0
      long b;  // 8 bytes -- at memory location 1 (not aligned to 8 byte boundry)
  } __attribute__((packed)) A;

  int main(int argc, char *argv[])
  {
      A *x = malloc(sizeof(A)); // = sizeof(char) + sizeof(long) - 1 + sizeof(long)
      printf("the size of A is %u", sizeof(A));
      return 0;
  }

the size of A is 9

Row versus Column

A (2 byte int)	B (1 byte char)
13	a
15	b
16	c

Row-oriented storage

assume that 6 bytes stored on each pages (no page headers)

Block 1
[13,a]
[15,b]
Block 2
[16,c]

Column-oriented storage

Block 1
[13,15,16]

Block 2
[a,b,c]

1

SELECT A FROM R;

Sorting and binary association tables

resort A: [16,15,13]

loose the value-row association!
solution: binary tables

RID	A
1	13
2	15
3	16

sorting can now be done without loosing the association between values and rows.

RID	A
3	16
2	15
1	13

Hybrid formats

e.g., store data column within the page, but still each record is stored on a single page (or use chunks that are multiple pages each)

Block 1
[13,15]
[a,b]
Block 2
[16],[c]

Lecture <2020-09-02 Wed>

Lecture <2020-09-09 Wed>

Tree depth calculation

Binary trees

each node has two children

1024 entries -> depth $$\lceil log_2(1024)\rceil = 10$$

B+-trees

each node is one block size
e.g., 256 children per node

1024 entries -> depth $$\lceil log_{256}(1024) \rceil = 2$$

Lecture <2020-09-14 Mon>

Lecture <2020-09-16 Wed>

Multi-key indexes

table R(A,B,C,D)
- #rows = 10,000,000
- #rows where A = 3: 1,000,000
- #rows where B = 4: 2,000,000
- #rows where A = 3 AND B = 4: 300
index on A and an index on B

1
2
3


SELECT *
FROM R
WHERE A = 3 AND B = 4

index A and B

Multi-key B-tree?

B-tree on (A,B)

A	B	C	D
1	1	a	b
1	2	a	c
3	4	a	d

Index 1: Keys (A,B):
- "1,1"
- "1,2"
- "3,4"
Keys (B,A):
- "1,1"
- "2,1"
- "4,3"
Can we use index 1 to answer a query
- Queries on A,B: YES
- Queries on A: YES, example A=3
- Queries on B: NO, example B=2

Grid files

2 dimensions with 100 values
- number of directory entries?
10 dimensions with 100 values
number of directory entries?

Lecture <2020-09-23 Wed>

1
2
3
4
5


  SELECT a
    FROM
        (SELECT b
           FROM R
          WHERE a < 15) sub

Lecture <2020-09-28 Mon>

Order-by

relations (bags and sets are unordered)
- order-by cannot exist as an operator
order as data

A	B
3	4
1	5
7	2

1

SELECT * FROM R ORDER BY A ASC;

query result

A	B
1	5
3	4
7	2

algebra result (order by operator) $$\omega_{A \rightarrow position}(R)$$

A	B	position
3	4	2
1	5	1
7	2	3

1
2
3
4


  SELECT *
    FROM Employee
   ORDER BY salary
   LIMIT 3

$$\sigma_{position \leq 3}(\omega_{salary \to position}(Employee)$$

1
2
3
4


  SELECT *
    FROM Employee
   ORDER BY salary
  LIMIT 1;

$$q_{max} = \pi_{name, salary}(\alpha_{max(salary) \to X}(Employee) \bowtie_{X = salary} (Employee)))$$ $$q_{top-2} = \pi_{name, salary}(\alpha_{max(salary) \to X}(\pi_{salary}(Employee) - \alpha_{max(salary)}(Employee)) \bowtie_{X = salary} (Employee)))$$ $$q = q_{max} \cup q_{top-2}$$

1
2
3


  SELECT *
    FROM R
   WHERE EXISTS (SELECT * FROM S)

unnesting
decorrelation

1
2
3
4
5
6


  SELECT R.*
    FROM R, (SELECT DISTINCT 1 FROM S)

  SELECT R.*
    FROM R, (SELECT count(*) AS cnt FROM S) AS sub
  WHERE cnt > 0

$$\pi_{A}(\sigma_{A = 3}(R \times S))$$

$$(\pi_{A}(\sigma_{A=3}(R)) \times \pi_{}(S))$$

Lecture <2020-09-30 Wed>

A	B	size
1	asdasd	5
2	asdasdsadasdasd	10
3	asdasda	6

$$S(R) = 7$$

$$\sigma_{A = 1}(R)$$

A	B	size
1	asdasd	5

A	B
1	2
1	2
3	4
4	5

V(R,A) = 3

$$\sigma_{A = 4}(R)$$

A	B
4	5

multi-dimensional histogram

10 buckets per attribute
$$n$$ and $$B$$ buckets, then n-dimensional histogram $$B^n$$ buckets.

Lecture <2020-10-07 Wed>

Lecture <2020-10-12 Mon>

Lecture <2020-10-14 Wed>

Question 1.1.6 (transitive closure with recursion)

  digraph {
  rankdir=LR;

  a1 -> a2;
  a1 -> a3;
  a2 -> a4;
  a4 -> a5;

  }

author	coauthor
a1	a2
a1	a3
a2	a4
a4	a5

direct co-authors

1

  SELECT * FROM published;

direct one step indirect co-authors

1
2
3
4
5


  SELECT * FROM published
  UNION
  SELECT p1.author, p2.coauthor
    FROM published p1, published p2
   WHERE p1.coauthor = p2.author;

indirect up to two step indirect co-authors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


  WITH oneStep AS (
    SELECT * FROM published
  ),
  twoStep AS (
  SELECT p1.author, p2.coauthor
    FROM oneStep p1, published p2
   WHERE p1.coauthor = p2.author
  ),
  threeStep AS (
  SELECT p1.author, p2.coauthor
    FROM twoStep p1, published p2
   WHERE p1.coauthor = p2.author
  )
  SELECT * FROM oneStep
   UNION
  SELECT * FROM twoStep
   UNION
  SELECT * FROM threeStep;

recursive queries

user provides $$Q_{init} \cup Q_{step}$$
- Initialization $$D_0 = Q_{init}(D)$$
- Iteration 1: $$D_1 = Q_{step}(D_0)$$

…

Iteration $i$: $$D_{i+1} = Q_{step}(D_i)$$

Fix point computation: stop when $$D_{i} = D_{i-1}$$

Quiz 1 - 1.4.3 - result size estimation

query $$q = (student \bowtie_{name=student} \sigma_{hoursAttended =15}(attendsLecture) \bowtie_{lecture= title} lecture$$

subqueries

query Q1 $$q_1 = \sigma_{hoursAttended =15}(attendsLecture)$$
query Q2 $$q_2 = student \bowtie_{name=student} q_1$$
query Q3 $$q = q_2 \bowtie_{lecture=title} lecture$$

every subquery is just one operator
which might access other subqueries' results
ok, as long as we can estimate statistics for the results

Solve Subquery Q1

$$T(q_1) =\frac{T(attendsLecture)}{V(attendsLecture,hoursAttended)} = \frac{40,000}{100} = 400 $$ $$V(q_1, student) = min(V(attendsLecture,student),T(q_1)) = 400$$ $$V(q_1, lecture) = V(attendsLecture,lecture) = 8$$

Solve Subquery Q2

$$T(q_2) = \frac{T(students) \times T(q_1)}{max(V(student,name), V(q_1,student))} = \frac{8000 \times 400}{max(7500,400)} = \frac{8000 \times 400}{7500}$$ $$V(q_2,lecture) = V(q_1,lecture) = 8$$

Solve Final query Q

$$T(q) = \frac{T(q_2) \times T(lecture)}{max(V(q_2,lecture), V(lecture,title))} = \frac{8000 \times 400 \times 10}{7500 \times max(8,10)} = \frac{320000}{75} = 426.666666667 $$

Lecture <2020-10-19 Mon>

Exam Questions

exam duration: lecture duration
exam handed out at lecture start through blackboard
exam will be submitted to blackboard (send an email also just in case)

I/O estimation

$$R \bowtie S$$ $$B(R) = 10,000$$ $$B(S) = 30$$ $$M = 101$$

Block nested loop

$$ \lceil \frac{B(S)}{M-1} \rceil \times (min(B(S),M-1) + B(R))$$

$$ = \lceil \frac{30}{100} \rceil \times (30 + 10,000)$$ $$ = 1 \times 10,030 = 10,030$$

Merge join

are inputs sorted?
- if yes

$$B(R) + B(S) = 10,000 + 30 = 10,030$$

if no

cost sorting R + cost of sorting S + B(R) + B(S)
sorting R $$2 \times B(R) \times (1 + \lceil \log_{M-1}(\frac{B(R)}{M}))$$ $$2 \times 10,000 \times (1 + \lceil \log_{100}(\frac{10,000}{101}))$$ $$ 20,000 \times (1 + 1) = 40,000$$
sorting S $$2 \times B(S) \times (1 + \lceil \log_{M-1}(\frac{B(S)}{M}))$$ $$2 \times 30 \times (1 + 0) = 60$$
total cost $$40,000 + 60 + 10,000 + 30 = 50,090$$
trick is if last merge phases can be done in memory in parallel, then we can save: $$2 \times (B(R) + B(S))$$
trick does not apply 101 for last merge of R and 30 for last merge of S

Hash join

if one of the tables fits into memory then (true for this example)

$$B(S) + B(R) = 10,000 + 30 = 10,030$$

if not, then we need

$$(B(R) + B(S)) (1 + 2 \times (\lceil \log_{M-1}(min(B(R),B(S))) \rceil - 1)$$

Lecture <2020-10-26 Mon>

why not use crossproduct always to avoid join reordering?

counter example
10 tables $$R_i$$ each has 1000 rows
size of the cross produce $$1000^10 = 10^30$$ with using $$10^6$$ cores, $$10^24$$

Lecture <2020-10-28 Wed>

Lecture <2020-11-02 Mon>

Lecture <2020-11-04 Wed>

1
2


  T1: UPDATE EMP SET job = 'system analyst' WHERE job = 'programmer'
  T2: UPDATE EMP SET job = 'programmer' WHERE job = 'system analyst'