CS520 - Data Integration, Warehousing, and Provenance - 2022 Spring

Course Webpage for CS520 - 2022 Spring taught by Boris Glavic

Course Overview

The class takes place: Monday + Wednesday, 10:00am - 11:15am, Wishnick Hall 116

This course introduces the basic concepts of data integration, data warehousing, and provenance . We will learn how to resolve structural heterogeneity through schema matching and mapping. The course introduces techniques for querying several heterogeneous data sources at once ( data integration ) and translating data between databases with different data representations ( data exchange). Furthermore, we will cover the data-warehouse paradigm including the Extract-Transform-Load (ETL) process, the data cube model and its relational representations (such as snowflake and star schema), and efficient processing of analytical queries. This will be contrasted with Big Data analytics approaches that (besides other differences) significantly reduce the upfront cost of analytics. When feeding data through complex processing pipelines such as data exchange transformations or ETL workflows, it is easy to loose track of the origin of data. In the last part of the course we therefore cover techniques for representing and keeping track of the origin and creation process of data - aka its provenance .

The course is emphasizing practical skills through a series of homework assignments that help students develop a strong background in data integration systems and techniques. At the same time, it also addresses the underlying formalisms. For example, we will discuss the logic based languages used for schema mapping and the dimensional data model as well as their practical application (e.g., developing an ETL workflow with rapid miner and creating a mapping between two example schemata). The literature reviews will familiarize students with data integration and provenance research.

Syllabus

The syllabus is available online: syllabus

Instructor

Boris Glavic

TAs

Pengyuan Li

  • Email: pli26@hawk.iit.edu
  • Phone: online only
  • Office: https://us02web.zoom.us/j/2391995075?pwd=WUx3eGNHVWpXWElhMUg5bERiNWw4UT09
  • Office Hours: Tuesdays + Thursdays, 1:00 pm - 2:00 pm

Supplementary material

There is a github repository for the course that will be updated with links and example code. The final curation project results will be uploaded to this repository.

More info will be added during the course.

Prerequisites

  • Courses: CS425 or CS525

Grading Policies

Weighting of Deliverables

  • data curation project: 20%
  • literature review: 20%
  • midterm exam: 30%
  • final exam: 30%

Grading Scheme

Your final course grade is determined based on your total score which is calculated as the weighted sum of the points for each of the deliverables. The weights are as shown above. For each deliverable you will receive between 0 and 100 points. For some deliverables, I am giving additional bonus points. These are not considered for the grade cutoffs. For instance, the first programming assignment is weighted 10%. For sake of the example assume that you get 110 points in this assignment (full points + bonus points), then this assignment would contribute \(0.1 * 110 = 11\) points to your final score.

  • A: > 80
  • B: > 60
  • C: > 50
  • E: < 50

Reading Material

The book Principles of Data Integration is required reading. One of the database textbooks is also useful. All four textbooks have their merits, but any one should be sufficient as reading material if you are lacking background. The Foundations of Databases can be useful for comprehending the more theoretical aspects of this course and is freely available online.

  • Doan, Halevy, and Ives, Principles of Data Integration, 1th Edition, Morgan Kaufmann, 2012
  • Elmasri and Navathe, Fundamentals of Database Systems, 6th Edition, Addison-Wesley, 2003
  • Ramakrishnan and Gehrke, Database Management Systems, 3nd Edition, McGraw-Hill, 2002
  • Silberschatz, Korth, and Sudarshan, Database System Concepts, 6th Edition, McGraw Hill, 2010
  • Garcia-Molina, Ullman, and Widom, Database Systems: The Complete Book, 2nd Edition, Prentice Hall,
  • Abiteboul, Hull, and Vianu, Foundations of Databases, , Addison-Wesley, 1995

The slides for the course will be made available here.

Last updated on 28 Dec 2021
Published on 28 Dec 2021