CS520 - Data Integration, Warehousing, and Provenance
The class takes place: Monday + Wednesday, 10:00am - 11:15am, Wishnick Hall 116
- Email: firstname.lastname@example.org
- Phone: 312 567 5205
- Office: Stuart Building, room 206b (online at the beginning of the semester)
- Office Hours: Wednesdays, 11:45 am - 12:45 pm
- Webpage: http://www.cs.iit.edu/~dbgroup/members/bglavic.html
This course introduces the basic concepts of data integration, data warehousing, and provenance . We will learn how to resolve structural heterogeneity through schema matching and mapping. The course introduces techniques for querying several heterogeneous data sources at once ( data integration ) and translating data between databases with different data representations ( data exchange). Furthermore, we will cover the data-warehouse paradigm including the Extract-Transform-Load (ETL) process, the data cube model and its relational representations (such as snowflake and star schema), and efficient processing of analytical queries. This will be contrasted with Big Data analytics approaches that (besides other differences) significantly reduce the upfront cost of analytics. When feeding data through complex processing pipelines such as data exchange transformations or ETL workflows, it is easy to loose track of the origin of data. In the last part of the course we therefore cover techniques for representing and keeping track of the origin and creation process of data - aka its provenance .
The course is emphasizing practical skills through a series of homework assignments that help students develop a strong background in data integration systems and techniques. At the same time, it also addresses the underlying formalisms. For example, we will discuss the logic based languages used for schema mapping and the dimensional data model as well as their practical application (e.g., developing an ETL workflow with rapid miner and creating a mapping between two example schemata). The literature reviews will familiarize students with data integration and provenance research.
- CS 425 with minimum grade of C
The following text book will be helpful for following the course and studying the presented material.
- Doan, Halevy, and Ives, Principles of Data Integration, 1th Edition, Morgan Kaufmann, 2012
- Elmasri and Navathe, Fundamentals of Database Systems, 6th Edition, Addison-Wesley, 2003
- Ramakrishnan and Gehrke, Database Management Systems, 3nd Edition, McGraw-Hill, 2002
- Silberschatz, Korth, and Sudarshan, Database System Concepts, 6th Edition, McGraw Hill, 2010
- Garcia-Molina, Ullman, and Widom, Database Systems: The Complete Book, 2nd Edition, Prentice Hall,
- Abiteboul, Hull, and Vianu, Foundations of Databases, , Addison-Wesley, 1995
Slides are available here.
The following topics will be covered in the course:
Introduction (3 hours)
- Heterogeneity of data
- Uncertainty and incompleteness
- Autonomous and distributed data sources – Structured vs. unstructured data
Preprocessing and Cleaning (4 hours)
- Entity resolution – Data fusion
Data Integration (10 hours)
- Mediated schemata and query rewrite – Schema matching
- Schema mappings
Data Exchange (5 hours)
- Data exchange transformations
- Universal solutions
Data Warehousing (10 hours)
- Extract-transform-load (ETL)
- Data cubes
- Star- and snowflake schemas
- Efficient analytics (OLAP) and relationship to transactional relational systems (OLTP)
Big Data Analytics (3 hours)
- Big Data analytics platforms and programming models
- Differences between Big Data analytics and traditional warehousing approaches – Big Data integration
Data Provenance (10 hours)
- Why- and Where-provenance
- Provenance polynomials
- Provenance in data integration – Provenance for missing answers
After attending the course students should:
- Understand the differences and similarities between the data integration/exchange, data warehouse, and Big Data analytics approaches
- Be able to build parts of a small data integration pipeline by “glueing” existing systems with new code
- Have learned formal languages for expressing schema mappings
- Understand the difference between virtual and materialized integration (data integration vs. data exchange)
- Understand the concept of data provenance and know how to compute provenance
The workload will consist of
- A midterm and a final exam covering the topics of the course.
- Several homework assignments. Over the course of the assignments the students will build a small data integration pipeline - starting from preprocessing, over schema matching and mapping, to transformation and query rewrite. These homework assignments are not graded.
- Review a research paper related to data integration, big data, or provenance.
- Curate a real world dataset and present the result.
Weighting of Deliverables
- data curation project: 20%
- literature review: 20%
- midterm exam: 30%
- final exam: 30%
Your final course grade is determined based on your total score which is calculated as the weighted sum of the points for each of the deliverables. The weights are as shown above. For each deliverable you will receive between
100 points. For some deliverables, I am giving additional bonus points. These are not considered for the grade cutoffs. For instance, the first programming assignment is weighted 10%. For sake of the example assume that you get 110 points in this assignment (full points + bonus points), then this assignment would contribute \(0.1 * 110 = 11\) points to your final score.
- A: > 80
- B: > 60
- C: > 50
- E: < 50
You are welcome to discuss assignments with classmates, but all final work must be your own. For details on what constitutes academic dishonesty, consult the university's Code of Academic Honesty at https://web.iit.edu/student-affairs/handbook/fine-print/code-academic-honesty. Any confirmed cases of academic dishonesty will be reported to email@example.com, and any work involved will, at the very least, will receive a reduction in grade deemed appropriate by the instructor.
Reasonable accommodations will be made for students with documented disabilities. In order to receive accommodations, students must obtain a letter of accommodation from the Center for Disability Resources and make an appointment to speak with the instructor as soon as possible. The Center for Disability Resources is located in the Life Sciences Building, room 218, tel:312-567-5744 or firstname.lastname@example.org.
IllinoisTech's Sexual Harassment and Discrimination Information
- Sexual harassment, sexual misconduct, and gender discrimination by any member of the Illinois Tech community is prohibited. This includes harassment among students, staff, or faculty. Sexual harassment by a faculty member or teaching assistant of a student over whom they have authority or by a supervisor of a member of the faculty or staff is particularly serious. Such conduct may easily create an intimidating, hostile, or offensive environment.
- Illinois Tech encourages anyone experiencing sexual harassment or sexual misconduct to speak with the Title IX Office for information on the resolution process and support options.
- You can file a complaint electronically at http://iit.edu/incidentreport, which may be completed anonymously. You may also file a complaint in-person by contacting the Title IX Coordinator, Virginia Foster at tel:312-567-5725 / mailto:email@example.com or the Deputy Title IX Coordinator tel:312-567-5726 / mailto:firstname.lastname@example.org.
- If you are not ready to file a formal complaint but wish to learn about your rights and options, you may contact Illinois Tech's Confidential Advisor service at tel:773-907-1062. You can also contact a licensed practitioner in Illinois Tech's Student Health and Wellness Center at tel:312-567-7550
- For a comprehensive list of resources regarding counseling services, medical assistance, legal assistance and visa and immigration services, you can visit the Title IX Office's website at https://web.iit.edu/hea/resources