CS520 - Index - CS520 - Data Integration, Warehousing, and Provenance

Course Overview

The class takes place: Monday + Wednesday, 10:00am - 11:15am, Wishnick Hall 116

This course introduces the basic concepts of data integration, data warehousing, and provenance . We will learn how to resolve structural heterogeneity through schema matching and mapping. The course introduces techniques for querying several heterogeneous data sources at once ( data integration ) and translating data between databases with different data representations ( data exchange). Furthermore, we will cover the data-warehouse paradigm including the Extract-Transform-Load (ETL) process, the data cube model and its relational representations (such as snowflake and star schema), and efficient processing of analytical queries. This will be contrasted with Big Data analytics approaches that (besides other differences) significantly reduce the upfront cost of analytics. When feeding data through complex processing pipelines such as data exchange transformations or ETL workflows, it is easy to loose track of the origin of data. In the last part of the course we therefore cover techniques for representing and keeping track of the origin and creation process of data - aka its provenance .

The course is emphasizing practical skills through a series of homework assignments that help students develop a strong background in data integration systems and techniques. At the same time, it also addresses the underlying formalisms. For example, we will discuss the logic based languages used for schema mapping and the dimensional data model as well as their practical application (e.g., developing an ETL workflow with rapid miner and creating a mapping between two example schemata). The literature reviews will familiarize students with data integration and provenance research.

Syllabus

The syllabus is available online: syllabus

Important Dates

01/24 - Deadline: Form groups
02/02 - Literature review - Select literature review paper
03/02 - Literature review - Read the paper and determine the structure of the report and meet with Prof./TA to discuss structure
03/02 - Data Curation Project - Select what dataset (sources to use) and discuss with Prof./TA
03/07 - Midterm exam
03/28 - Data Curation Project - Present the data quality problems you have identified and explain how you plan to solve them (meet with Prof./TA)
04/13 - Literature review and Data Curation Project - First draft of slides due for both and slide review meeting with Prof./TA. Show how you solved the data quality problems.
04/29 - Literature Review and Data Curation Project - In-class Presentations
05/04 - Literature review and Data Curation Project - Final versions of reports due
05/04 - Final exam

Instructor

Boris Glavic

Email: bglavic@iit.edu
Phone: 312 567 5205
Office: Stuart Building, room 206b (online at the beginning of the semester)
Office Hours: Wednesdays, 11:45 am - 12:45 pm
Webpage: http://www.cs.iit.edu/~dbgroup/members/bglavic.html

TAs

Pengyuan Li

Email: pli26@hawk.iit.edu
Phone: online only
Office: https://us02web.zoom.us/j/2391995075?pwd=WUx3eGNHVWpXWElhMUg5bERiNWw4UT09
Office Hours: Tuesdays + Thursdays, 1:00 pm - 2:00 pm

Supplementary material

There is a github repository for the course that will be updated with links and example code. The final curation project results will be uploaded to this repository.

https://github.com/IITDBGroup/cs520

More info will be added during the course.

Workload

Prerequisites

Courses: CS425 or CS525

Grading Policies

Weighting of Deliverables

data curation project: 20%
literature review: 20%
midterm exam: 30%
final exam: 30%

Grading Scheme

Your final course grade is determined based on your total score which is calculated as the weighted sum of the points for each of the deliverables. The weights are as shown above. For each deliverable you will receive between 0 and 100 points. For some deliverables, I am giving additional bonus points. These are not considered for the grade cutoffs. For instance, the first programming assignment is weighted 10%. For sake of the example assume that you get 110 points in this assignment (full points + bonus points), then this assignment would contribute \(0.1 * 110 = 11\) points to your final score.

A: > 80
B: > 60
C: > 50
E: < 50

Reading Material

The book Principles of Data Integration is required reading. One of the database textbooks is also useful. All four textbooks have their merits, but any one should be sufficient as reading material if you are lacking background. The Foundations of Databases can be useful for comprehending the more theoretical aspects of this course and is freely available online.

Doan, Halevy, and Ives, Principles of Data Integration, 1th Edition, Morgan Kaufmann, 2012
Elmasri and Navathe, Fundamentals of Database Systems, 6th Edition, Addison-Wesley, 2003
Ramakrishnan and Gehrke, Database Management Systems, 3nd Edition, McGraw-Hill, 2002
Silberschatz, Korth, and Sudarshan, Database System Concepts, 6th Edition, McGraw Hill, 2010
Garcia-Molina, Ullman, and Widom, Database Systems: The Complete Book, 2nd Edition, Prentice Hall,
Abiteboul, Hull, and Vianu, Foundations of Databases, , Addison-Wesley, 1995

The slides for the course will be made available here.

Last updated on 28 Dec 2021
Published on 28 Dec 2021