CS 595 - info

Syllabus

Course Description

With the ever increasing amount of digital information comes an increasing need to understand "where" an piece of data (data item) is coming from, "why" it is in the result of a data transformation, and "how" it was produced by the transformation. For example, biologists use complex digital workflow and simulations to gain new insights from measurement and derived data. The result data of a complex workflow is meaningless without information of how the data was produced from which input data. This type of information, i.e., information about the creation process and origin of data, is called data provenance.

Systems that automatically track provenance information for data produced by e.g., workflows or SQL queries are becoming more and more important. Data provenance is an emerging technology which is used to, e.g., trace errors in transformed data back to its origin or gain additional insights about the data. This course introduces several models of provenance developed for domains such as databases and workflow systems. We will cover approaches for automatically tracking provenance, and study query languages and storage mechanism for provenance information. Furthermore, we will discuss real systems that generate provenance data. This course gives the students the opportunity to learn about a hot topic in database research and work with novel research prototype provenance systems.

Textbooks

No text book is required. Required reading will consist of research publications that are available online and will be linked on the course schedule page

Detailed Course Topics

Introduction to data provenance

What is data provenance?
Why do we need it?
Understanding different types of data provenance

Database Provenance

Approaches for automatically tracking provenance
Provenance models and systems

Why-provenance
Where-provenance and the DBNotes system
Lineage and the WHIPS prototype
Witness-list semantics and Perm
Provenance semirings and Orchestra
Causality and responsibility models

Storage mechanisms
Query languages

Extensions of the provenance concept

Provenance for missing answers
Provenance for past queries
Provenance for updates

Beyond database provenance

Scientific workflows
Provenance in the operating system context
Connection with dataflow analysis in programming languages

Grading Policies

Course Project (Implementation, written report, oral presentation): 60%
Paper reviews: written review (15%) and oral presentation (15%)
Participation in the paper discussions: (10%)

Course Objectives

After attending the course students should be able to:

Understand the concept of data provenance
Understand different types of data provenance, their application domains, and relationship to each other
Understand data provenance models; know their advantages and limitations
Understand generation and storage mechanisms for data provenance
Build or extend a system for tracking data provenance
Read, understand, and summarize a research paper