CS 520 - index

Instructor

Boris Glavic: http://cs.iit.edu/~glavic
Email: bglavic@iit.edu
Phone: 312.567.5205
Office: Stuart Building, room 226C
Office Hours: Wednesdays, 12:00-1:00pm
Course Webpage: http://cs.iit.edu/~cs520
DBGroup Webpage: http://cs.iit.edu/~dbgroup
Google Group: https://groups.google.com/d/forum/cs520-2015-spring-group

TAs

Eduardo Berrocal
Email: eberroca@hawk.iit.edu
Phone: 312 567 7518
Office: Stuart Building, room SB 012
Office Hours: Mondays and Wednesdays, 1:00pm

Xi Zhang
Email: xzhang22@hawk.iit.edu
Phone:
Office: Stuart Building, room SB 115
Office Hours: Mondays and Wednesdays, 2:00pm - 4:00pm

Syllabus

syllabus.pdf

Course Overview

3:15pm - 4:30pm, Mondays + Wednesdays, SB 104

This course introduces the basic concepts of data integration, data warehousing, and provenance. We will learn how to resolve structural heterogeneity through schema matching and mapping. The course introduces techniques for querying several heterogeneous datasources at once (data integration) and translating data between databases with different data representations (data exchange). Furthermore, we will cover the data-warehouse paradigm including the Extract-Transform-Load (ETL) process, the data cube model and its relational representations (such as snowflake and star schema), and efficient processing of analytical queries. This will be contrasted with Big Data analytics approaches that (besides other differences) significantly reduce the upfront cost of analytics. When feeding data through complex processing pipelines such as data exchange transformations or ETL workflows, it is easy to loose track of the origin of data. In the last part of the course we therefore cover techniques for representing and keeping track of the origin and creation process of data - aka its provenance.

The course is emphasizing practical skills through a series of homework assignments that help students develop a strong background in data integration systems and techniques. At the same time, it also addresses the underlying formalisms. For example, we will discuss the logic based languages used for schema mapping and the dimensional data model as well as their practical application (e.g., developing an ETL workflow with rapid miner and creating a mapping between two example schemata). The literature reviews will familiarize students with data integration and provenance research.

Workload

Prerequisites

Courses:CS 425 or CS 525

Reading Material

The following text book is required reading material for the course.

Doan, Halevy, and Ives. Principles of Data Integration, 1st Edition, Morgan Kaufmann, 2012

Having an introductory database book at hand will be helpful. Here are some suggestions

Elmasri and Navathe , Fundamentals of Database Systems, 6th Edition , Addison-Wesley , 2003
Ramakrishnan and Gehrke , Database Management Systems, 3nd Edition , McGraw-Hill , 2002
Silberschatz, Korth, and Sudarshan , Database System Concepts, 6th Edition , McGraw Hill , 2010
Garcia-Molina, Ullman, and Widom, Database Systems: The Complete Book, 2nd Edition, Prentice Hall, 2008

The slides for the course will be made available on the course schedule page

Grading Policies

Literature Review: 40%
Final Exam: 60%

Detailed Course Topics

Introduction

Heterogeneity of data
Uncertainty and incompleteness
Autonomous and distributed data sources
Structured vs. unstructured data

Preprocessing and Cleaning

Entity resolution
Data fusion
Cleaning

Data Integration

Mediated schemata and query rewrite
Schema matching
Schema mappings

Data Exchange

Data exchange transformations
Universal solutions

Data Warehousing

Extract-transform-load (ETL)
Data cubes
Star- and snowflake schemas
Efficient analytics (OLAP) and relationship to transactional relational systems (OLTP)

Big Data Analytics

Big Data analytics platforms and programming models
Differences between Big Data analytics and traditional warehousing approaches
Big Data integration

Data Provenance

Why- and Where-provenance
Provenance polynomials
Provenance in data integration
Provenance for missing answers

Course Objectives

After attending the course students should be able to:

Understand the problems that arise with querying heterogeneous and autonomous data sources
Understand the differences and similarities between the data integration/exchange, data warehouse, and Big Data analytics approaches
Be able to build parts of a small data integration pipeline by "glueing" existing systems with new code
Have learned formal languages for expressing schema mappings
Understand the difference between virtual and materialized integration (data integration vs. data exchange)
Understand the concept of data provenance and know how to compute provenance

CS 520: Data Integration, Warehousing, and Provenance - 2015 Spring