Syllabus
syllabus.pdfCourse Description
With the ever increasing amount of digital information comes an increasing need to understand "where" an piece of data (data item) is coming from, "why" it is in the result of a data transformation, and "how" it was produced by the transformation. For example, biologists use complex digital workflow and simulations to gain new insights from measurement and derived data. The result data of a complex workflow is meaningless without information of how the data was produced from which input data. This type of information, i.e., information about the creation process and origin of data, is called data provenance.
Systems that automatically track provenance information for data produced by e.g., workflows or SQL queries are becoming more and more important. Data provenance is an emerging technology which is used to, e.g., trace errors in transformed data back to its origin or gain additional insights about the data. This course introduces several models of provenance developed for domains such as databases and workflow systems. We will cover approaches for automatically tracking provenance, and study query languages and storage mechanism for provenance information. Furthermore, we will discuss real systems that generate provenance data. This course gives the students the opportunity to learn about a hot topic in database research and work with novel research prototype provenance systems.
Textbooks
No text book is required. Required reading will consist of research publications that are available online and will be linked on the course schedule page
Detailed Course Topics
- Introduction to data provenance
- What is data provenance?
- Why do we need it?
- Understanding different types of data provenance
- Database Provenance
- Approaches for automatically tracking provenance
- Provenance models and systems
- Why-provenance
- Where-provenance and the DBNotes system
- Lineage and the WHIPS prototype
- Witness-list semantics and Perm
- Provenance semirings and Orchestra
- Causality and responsibility models
- Storage mechanisms
- Query languages
- Extensions of the provenance concept
- Provenance for missing answers
- Provenance for past queries
- Provenance for updates
- Beyond database provenance
- Scientific workflows
- Provenance in the operating system context
- Connection with dataflow analysis in programming languages
Grading Policies
- Course Project (Implementation, written report, oral presentation): 60%
- Paper reviews: written review (15%) and oral presentation (15%)
- Participation in the paper discussions: (10%)
Course Objectives
After attending the course students should be able to:
- Understand the concept of data provenance
- Understand different types of data provenance, their application domains, and relationship to each other
- Understand data provenance models; know their advantages and limitations
- Understand generation and storage mechanisms for data provenance
- Build or extend a system for tracking data provenance
- Read, understand, and summarize a research paper