CS525 - Advanced Database Organization - 2020 Fall

Course webpage for CS525 - 2020 Fall taught by Boris Glavic

Lecture Notes for CS525

Lecture <2020-08-24 Mon>

Added: [2020-08-24 Mon 17:38]

  • hash table GPA -> students

  • binary tree GPA -> students

    1
    
    SELECT * FROM students WHERE GPA > 3.5
  • can't be used

  • hash table name > students

1
SELECT * FROM students WHERE name = 'Jones'
  • insert new student "Peter Smith"

Name GPA Major
Bob Bobsen 3.5 CS
Alice Chenjie 4.0 CS
Astrid Walsen 2.3 CS

Algorithm 1

  • append to file

Name GPA Major
Bob Bobsen 3.5 CS
Alice Chenjie 4.0 CS
Astrid Walsen 2.3 CS
Peter Smith 1.0 CS
  • now sort

Name GPA Major Bytes
Bob Bobsen 3.5 CS 15
Alice Chenjie 4.0 CS 18
Peter Smith 1.0 CS 16
Astrid Walsen 2.3 CS 18
  • delete Peter, change position of all of the rows after Peter $$O(n)$$

  • table R has n rows and table S has m rows.

  • tables R, S, and T and each has 1000 rows

1
2
3
SELECT *
FROM R, S, T
WHERE R.A = S.B AND S.C = T.D;
FOR EACH Row in R
   FOR EACH Row in S
       FOR EACH Row in T
           combine the row

$$O(n^3)$$ complexity

generalizing this: for an m-way join we have $$O(n^m)$$

misleading complexity

$$O(n) = 1 \cdots n$$ $$O(log n) = 10^9 \cdots n$$ O-notation ignores constant factors

Lecture <2020-08-26 Wed>

Lecture <2020-08-31 Mon>

Alignment and packing

  • C adds padding to align data types to avoid having to move values at runtime when accessing them

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
      #include <stdio.h>
    
      typedef struct A {
        char a;  // 1 byte -- at Memory location 0
        // inject padding 7 byte
        long b;  // 8 bytes -- at memory location 8
      } A;
    
      int main(int argc, char *argv[])
      {
          A *x = malloc(sizeof(A)); // = sizeof(char) + sizeof(long) - 1 + sizeof(long)
          printf("the size of A is %u", sizeof(A));
          return 0;
      }
    the size of A is 16
    
  • you can force the compiler to not align to save space, e.g., in gcc use __attribute__((packed))

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
      #include <stdio.h>
    
      typedef struct A {
          char a;  // 1 byte -- at Memory location 0
          long b;  // 8 bytes -- at memory location 1 (not aligned to 8 byte boundry)
      } __attribute__((packed)) A;
    
      int main(int argc, char *argv[])
      {
          A *x = malloc(sizeof(A)); // = sizeof(char) + sizeof(long) - 1 + sizeof(long)
          printf("the size of A is %u", sizeof(A));
          return 0;
      }
    the size of A is 9
    

Row versus Column

A (2 byte int) B (1 byte char)
13 a
15 b
16 c

Row-oriented storage

  • assume that 6 bytes stored on each pages (no page headers)

Block 1
[13,a]
[15,b]
Block 2
[16,c]

Column-oriented storage

Block 1
[13,15,16]

Block 2
[a,b,c]
1
SELECT A FROM R;

Sorting and binary association tables

resort A: [16,15,13]

  • loose the value-row association!

  • solution: binary tables

RID A
1 13
2 15
3 16
  • sorting can now be done without loosing the association between values and rows.

RID A
3 16
2 15
1 13

Hybrid formats

  • e.g., store data column within the page, but still each record is stored on a single page (or use chunks that are multiple pages each)

Block 1
[13,15]
[a,b]
Block 2
[16],[c]

Lecture <2020-09-02 Wed>

Lecture <2020-09-09 Wed>

Tree depth calculation

Binary trees

  • each node has two children

1024 entries -> depth $$\lceil log_2(1024)\rceil = 10$$

B+-trees

  • each node is one block size

  • e.g., 256 children per node

1024 entries -> depth $$\lceil log_{256}(1024) \rceil = 2$$

Lecture <2020-09-14 Mon>

Lecture <2020-09-16 Wed>

Multi-key indexes

  • table R(A,B,C,D)

    • #rows = 10,000,000

    • #rows where A = 3: 1,000,000

    • #rows where B = 4: 2,000,000

    • #rows where A = 3 AND B = 4: 300

  • index on A and an index on B

1
2
3
SELECT *
FROM R
WHERE A = 3 AND B = 4
  • index A and B

Multi-key B-tree?

  • B-tree on (A,B)

A B C D
1 1 a b
1 2 a c
3 4 a d
  • Index 1: Keys (A,B):

    • "1,1"

    • "1,2"

    • "3,4"

  • Keys (B,A):

    • "1,1"

    • "2,1"

    • "4,3"

  • Can we use index 1 to answer a query

    • Queries on A,B: YES

    • Queries on A: YES, example A=3

    • Queries on B: NO, example B=2

Grid files

  • 2 dimensions with 100 values

    • number of directory entries?

  • 10 dimensions with 100 values

  • number of directory entries?

Lecture <2020-09-23 Wed>

1
2
3
4
5
  SELECT a
    FROM
        (SELECT b
           FROM R
          WHERE a < 15) sub

Lecture <2020-09-28 Mon>

Order-by

  • relations (bags and sets are unordered)

    • order-by cannot exist as an operator

  • order as data

A B
3 4
1 5
7 2
1
SELECT * FROM R ORDER BY A ASC;
  • query result

A B
1 5
3 4
7 2
  • algebra result (order by operator) $$\omega_{A \rightarrow position}(R)$$

A B position
3 4 2
1 5 1
7 2 3
1
2
3
4
  SELECT *
    FROM Employee
   ORDER BY salary
   LIMIT 3

$$\sigma_{position \leq 3}(\omega_{salary \to position}(Employee)$$

1
2
3
4
  SELECT *
    FROM Employee
   ORDER BY salary
  LIMIT 1;

$$q_{max} = \pi_{name, salary}(\alpha_{max(salary) \to X}(Employee) \bowtie_{X = salary} (Employee)))$$ $$q_{top-2} = \pi_{name, salary}(\alpha_{max(salary) \to X}(\pi_{salary}(Employee) - \alpha_{max(salary)}(Employee)) \bowtie_{X = salary} (Employee)))$$ $$q = q_{max} \cup q_{top-2}$$

1
2
3
  SELECT *
    FROM R
   WHERE EXISTS (SELECT * FROM S)
  • unnesting

  • decorrelation

1
2
3
4
5
6
  SELECT R.*
    FROM R, (SELECT DISTINCT 1 FROM S)

  SELECT R.*
    FROM R, (SELECT count(*) AS cnt FROM S) AS sub
  WHERE cnt > 0

$$\pi_{A}(\sigma_{A = 3}(R \times S))$$

$$(\pi_{A}(\sigma_{A=3}(R)) \times \pi_{}(S))$$

Lecture <2020-09-30 Wed>

A B size
1 asdasd 5
2 asdasdsadasdasd 10
3 asdasda 6

$$S(R) = 7$$

$$\sigma_{A = 1}(R)$$

A B size
1 asdasd 5
A B
1 2
1 2
3 4
4 5

V(R,A) = 3

$$\sigma_{A = 4}(R)$$

A B
4 5

multi-dimensional histogram

  • 10 buckets per attribute

  • $$n$$ and $$B$$ buckets, then n-dimensional histogram $$B^n$$ buckets.

Lecture <2020-10-07 Wed>

Lecture <2020-10-12 Mon>

Lecture <2020-10-14 Wed>

Question 1.1.6 (transitive closure with recursion)

  digraph {
  rankdir=LR;

  a1 -> a2;
  a1 -> a3;
  a2 -> a4;
  a4 -> a5;

  }

authorpub.jpg

author coauthor
a1 a2
a1 a3
a2 a4
a4 a5
  • direct co-authors

1
  SELECT * FROM published;
  • direct one step indirect co-authors

1
2
3
4
5
  SELECT * FROM published
  UNION
  SELECT p1.author, p2.coauthor
    FROM published p1, published p2
   WHERE p1.coauthor = p2.author;
  • indirect up to two step indirect co-authors

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
  WITH oneStep AS (
    SELECT * FROM published
  ),
  twoStep AS (
  SELECT p1.author, p2.coauthor
    FROM oneStep p1, published p2
   WHERE p1.coauthor = p2.author
  ),
  threeStep AS (
  SELECT p1.author, p2.coauthor
    FROM twoStep p1, published p2
   WHERE p1.coauthor = p2.author
  )
  SELECT * FROM oneStep
   UNION
  SELECT * FROM twoStep
   UNION
  SELECT * FROM threeStep;

recursive queries

  • user provides $$Q_{init} \cup Q_{step}$$

    • Initialization $$D_0 = Q_{init}(D)$$

    • Iteration 1: $$D_1 = Q_{step}(D_0)$$

  • Iteration $i$: $$D_{i+1} = Q_{step}(D_i)$$

  • Fix point computation: stop when $$D_{i} = D_{i-1}$$

Quiz 1 - 1.4.3 - result size estimation

  • query $$q = (student \bowtie_{name=student} \sigma_{hoursAttended =15}(attendsLecture) \bowtie_{lecture= title} lecture$$

subqueries

  • query Q1 $$q_1 = \sigma_{hoursAttended =15}(attendsLecture)$$

  • query Q2 $$q_2 = student \bowtie_{name=student} q_1$$

  • query Q3 $$q = q_2 \bowtie_{lecture=title} lecture$$

  • every subquery is just one operator

  • which might access other subqueries' results

  • ok, as long as we can estimate statistics for the results

Solve Subquery Q1

$$T(q_1) =\frac{T(attendsLecture)}{V(attendsLecture,hoursAttended)} = \frac{40,000}{100} = 400 $$ $$V(q_1, student) = min(V(attendsLecture,student),T(q_1)) = 400$$ $$V(q_1, lecture) = V(attendsLecture,lecture) = 8$$

Solve Subquery Q2

$$T(q_2) = \frac{T(students) \times T(q_1)}{max(V(student,name), V(q_1,student))} = \frac{8000 \times 400}{max(7500,400)} = \frac{8000 \times 400}{7500}$$ $$V(q_2,lecture) = V(q_1,lecture) = 8$$

Solve Final query Q

$$T(q) = \frac{T(q_2) \times T(lecture)}{max(V(q_2,lecture), V(lecture,title))} = \frac{8000 \times 400 \times 10}{7500 \times max(8,10)} = \frac{320000}{75} = 426.666666667 $$

Lecture <2020-10-19 Mon>

Exam Questions

  • exam duration: lecture duration

  • exam handed out at lecture start through blackboard

  • exam will be submitted to blackboard (send an email also just in case)

I/O estimation

$$R \bowtie S$$ $$B(R) = 10,000$$ $$B(S) = 30$$ $$M = 101$$

Block nested loop

$$ \lceil \frac{B(S)}{M-1} \rceil \times (min(B(S),M-1) + B(R))$$

$$ = \lceil \frac{30}{100} \rceil \times (30 + 10,000)$$ $$ = 1 \times 10,030 = 10,030$$

Merge join

  • are inputs sorted?

    • if yes

$$B(R) + B(S) = 10,000 + 30 = 10,030$$

  • if no

  • cost sorting R + cost of sorting S + B(R) + B(S)

  • sorting R $$2 \times B(R) \times (1 + \lceil \log_{M-1}(\frac{B(R)}{M}))$$ $$2 \times 10,000 \times (1 + \lceil \log_{100}(\frac{10,000}{101}))$$ $$ 20,000 \times (1 + 1) = 40,000$$

  • sorting S $$2 \times B(S) \times (1 + \lceil \log_{M-1}(\frac{B(S)}{M}))$$ $$2 \times 30 \times (1 + 0) = 60$$

  • total cost $$40,000 + 60 + 10,000 + 30 = 50,090$$

  • trick is if last merge phases can be done in memory in parallel, then we can save: $$2 \times (B(R) + B(S))$$

  • trick does not apply 101 for last merge of R and 30 for last merge of S

Hash join

  • if one of the tables fits into memory then (true for this example)

$$B(S) + B(R) = 10,000 + 30 = 10,030$$

  • if not, then we need

$$(B(R) + B(S)) (1 + 2 \times (\lceil \log_{M-1}(min(B(R),B(S))) \rceil - 1)$$

Lecture <2020-10-26 Mon>

why not use crossproduct always to avoid join reordering?

  • counter example

  • 10 tables $$R_i$$ each has 1000 rows

  • size of the cross produce $$1000^10 = 10^30$$ with using $$10^6$$ cores, $$10^24$$

Lecture <2020-10-28 Wed>

Lecture <2020-11-02 Mon>

Lecture <2020-11-04 Wed>

1
2
  T1: UPDATE EMP SET job = 'system analyst' WHERE job = 'programmer'
  T2: UPDATE EMP SET job = 'programmer' WHERE job = 'system analyst'

Lecture <2020-11-16 Mon>

Lecture <2020-11-23 Mon>

Last updated on 24 Aug 2020
Published on 24 Aug 2020