CS 520: Data Integration, Warehousing, and Provenance - 2016 Spring

Organization

Students have to form groups of TBA and each group will have to read a research paper, write a report, and give a 20 min presentation about the paper.

Presentation

The presentations will be given in a single block session on April 23 in room SB 111. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. Please read the links below on how to give a good presentation.

Schedule:

Detailed schedule:

TimeGroupTitle
Data cleaning and preprocessing
09:001Automating the approximate record-matching process
09:208Text classification from labeled and unlabeled documents using EM
09:4011 Nadeef: a commodity data cleaning system
10:009 Combining quantitative and logical data cleaning
10:2018 Declarative data cleaning: language, model, and algorithms
10:4019 Descriptive and prescriptive data cleaning
11:0023 Declarative support for sensor data cleaning
Integration, Matching, and Mappings
11:2014 Efficiently executing information-gathering plans
11:407 Clio: Schema Mapping Creation and Data Exchange
12:00Lunch break
01:005 Integrating conflicting data: the role of source dependence
01:204 Generic Schema Matching with Cupid
Data Warehousing
01:4010 Lenses: an on-demand approach to etl
02:0016 Algorithms for materialized view design in data warehousing environment
02:203 On-line warehouse view maintenance
Big Data
02:406 Asterixdb: a scalable, open source bdms
03:0012 All roads lead to rome: optimistic recovery for distributed iterative data processing
03:2015 On the design and scalability of distributed shared-data databases
03:4017 Spinning fast iterative data flows
04:0021 A practical scalable distributed B-tree
04:2020 Hyracks: a flexible and extensible foundation for data-intensive computing
04:402 Spark SQL: relational data processing in spark
Provenance
05:0022 Looking inside the black-box: capturing data provenance using dynamic instrumentation
05:2013 Trio: A System for Managing Data, Uncertainty, and Lineage

Report

You will have to write a report that summarizes the content of the paper, explains its main ideas in a way understandable by the other students in the course (they should not have to read another 20 paper to understand what you are writing about), and gives an objective critic of the presented methods or systems. There are no page limitations, but try to avoid lengthy and verbose writing as well as short and incomprehensible reports. Again read some of the links below to get some ideas about how to write a good paper or report.

Late policies:

Time schedule

We expect the deliverables according to the following deadlines:

Help for writing the report, preparing slides, and giving a talk

How to give a presentation and prepare slides: How to write a scientific article:

Literature Review Papers

The paper for the literature review part of the course are shown below. You will have until 01/30 to build groups. You will be able to vote on papers at 01/30 10:00am until 02/01 1pm. We will send you a link to a form.

Data cleaning and preparation

Entity resolution

Data fusion

Schema matching and mapping

Query answering with views and virtual data integration

Data exchange

Data warehousing

Big Data analytics

Provenance