Course Overview
The class takes place: Tuesday + Thursday, 10:00am - 11:15am, IIT Tower 1F6-1
This course introduces the basic concepts of data integration, data warehousing, and provenance . We will learn how to resolve structural heterogeneity through schema matching and mapping. The course introduces techniques for querying several heterogeneous data sources at once ( data integration ) and translating data between databases with different data representations ( data exchange). Furthermore, we will cover the data-warehouse paradigm including the Extract-Transform-Load (ETL) process, the data cube model and its relational representations (such as snowflake and star schema), and efficient processing of analytical queries. This will be contrasted with Big Data analytics approaches that (besides other differences) significantly reduce the upfront cost of analytics. When feeding data through complex processing pipelines such as data exchange transformations or ETL workflows, it is easy to loose track of the origin of data. In the last part of the course we therefore cover techniques for representing and keeping track of the origin and creation process of data - aka its provenance .
The course is emphasizing practical skills through a series of homework assignments that help students develop a strong background in data integration systems and techniques. At the same time, it also addresses the underlying formalisms. For example, we will discuss the logic based languages used for schema mapping and the dimensional data model as well as their practical application (e.g., developing an ETL workflow with rapid miner and creating a mapping between two example schemata). The literature reviews will familiarize students with data integration and provenance research.
Syllabus
The syllabus is available online: syllabus
Important Dates
- 08/31 - Deadline: Form groups
- 09/12 - Literature review - Select literature review paper
- 09/28 - Data Curation Project - Finish the Vizier First Steps Assignment
- 10/10 - Literature review - Read the paper and determine the structure of the report and meet with Prof./TA to discuss structure
- 10/10 - Data Curation Project - Select what dataset (sources to use) and discuss with Prof./TA
- 10/12 - Midterm exam
- 10/26 - Data Curation Project - Present the data quality problems you have identified and explain how you plan to solve them (meet with Prof./TA)
- 11/14 - Literature review and Data Curation Project - First draft of slides due for both and slide review meeting with Prof./TA. Show how you solved the data quality problems.
- 11/30 - Literature Review and Data Curation Project - In-class Presentations
- 11/30 - Literature review and Data Curation Project - Final versions of reports due
- 12/05 - Final exam
Instructor
Boris Glavic
- Email: bglavic@iit.edu
- Phone: 312 567 5205
- Office: Stuart Building, room 206b
- Office Hours: Tuesdays, 3:00 pm - 4:00 pm
- Webpage: http://www.cs.iit.edu/~dbgroup/members/bglavic.html
TAs
Pengyuan Li
- Email: pli26@hawk.iit.edu
- Phone: online only
- Office: SB012 and zoom: https://iit-edu.zoom.us/j/2391995075?pwd=WUx3eGNHVWpXWElhMUg5bERiNWw4UT09
- Office Hours: Tuesdays + Thursdays: 11:30 am - 12:30 pm
Chenjie Li
- Email: cli112@hawk.iit.edu
- Phone: online only
- Office: SB012 and zoom: https://iit-edu.zoom.us/j/89344897742?pwd=MllIZmJVZDNaSDIvQTZ1ZzFrVFoyQT09
- Office Hours: Mondays + Wednesdays: 11:30 am - 12:30 pm
Shubham Tiwari
- Email: stiwari10@hawk.iit.edu
- Phone: online only
- Office: SB 019 and https://meet.google.com/qfy-gmnz-qxx
- Office Hours: Tuesday + Thursday: 3:00 pm - 4:00 pm
Supplementary material
There is a github repository for the course that will be updated with links and example code. The final curation project results will be uploaded to this repository.
More info will be added during the course.
Prerequisites
- Courses: CS425 or CS525
Grading Policies
Weighting of Deliverables
- data curation project: 20%
- literature review: 20%
- midterm exam: 30%
- final exam: 30%
Grading Scheme
Your final course grade is determined based on your total score which is calculated as the weighted sum of the points for each of the deliverables. The weights are as shown above. For each deliverable you will receive between 0
and 100
points. For some deliverables, I am giving additional bonus points. These are not considered for the grade cutoffs. For instance, the first programming assignment is weighted 10%. For sake of the example assume that you get 110 points in this assignment (full points + bonus points), then this assignment would contribute \(0.1 * 110 = 11\) points to your final score.
- A: > 80
- B: > 60
- C: > 50
- E: < 50
Reading Material
The book Principles of Data Integration is required reading. One of the database textbooks is also useful. All four textbooks have their merits, but any one should be sufficient as reading material if you are lacking background. The Foundations of Databases can be useful for comprehending the more theoretical aspects of this course and is freely available online.
- Doan, Halevy, and Ives, Principles of Data Integration, 1th Edition, Morgan Kaufmann, 2012
- Elmasri and Navathe, Fundamentals of Database Systems, 6th Edition, Addison-Wesley, 2003
- Ramakrishnan and Gehrke, Database Management Systems, 3nd Edition, McGraw-Hill, 2002
- Silberschatz, Korth, and Sudarshan, Database System Concepts, 6th Edition, McGraw Hill, 2010
- Garcia-Molina, Ullman, and Widom, Database Systems: The Complete Book, 2nd Edition, Prentice Hall,
- Abiteboul, Hull, and Vianu, Foundations of Databases, , Addison-Wesley, 1995
The slides for the course will be made available here.