CS520 - Data Integration, Warehousing, and Provenance - 2022 Spring

Course Webpage for CS520 - 2022 Spring taught by Boris Glavic

Literature Review

Organization

Students have to form groups of 3 and each group will have to read a research paper, write a report, and give a 20 min presentation about the paper.

Presentation

The presentations will be given in a single block session on 04/29 in zoom . You can find the detailed presentation schedule here: Seminar Schedule. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. Please read the links below on how to give a good presentation.

Report

You will have to write a report that summarizes the content of the paper, explains its main ideas in a way understandable by the other students in the course (they should not have to read another 20 paper to understand what you are writing about), and gives an objective critic of the presented methods or systems. There are no page limitations, but try to avoid lengthy and verbose writing as well as short and incomprehensible reports. Again read some of the links below to get some ideas about how to write a good paper or report.

Late policies:

  • 1-3 days late: -10% points
  • 4-7 days late: -20% points
  • more than 7 days late: 0 points

Schedule

We expect the deliverables according to the following deadlines:

  • 01/24 - Deadline: Form groups
  • 02/02 - Literature review - Select literature review paper
  • 03/02 - Literature review - Read the paper and determine the structure of the report and meet with Prof./TA to discuss structure
  • 04/13 - Literature review and Data Curation Project - First draft of slides due for both and slide review meeting with Prof./TA. Show how you solved the data quality problems.
  • 04/29 - Literature Review and Data Curation Project - In-class Presentations
  • 05/04 - Literature review and Data Curation Project - Final versions of reports due

Help for writing the report, preparing slides, and giving a talk

How to give a presentation and prepare slides:

How to write a scientific article:

  • Page on how to write an CS article. Also comments on some general writing rules.
  • Simon Peyton Jones slides and video on how to write a great research paper

Literature Review Papers

You will have until 01/24 to form groups and until 02/02 to select what paper you want to review. We will send you a link to a form for voting on papers. You can access the pdfs of the papers on google drive (you need to log in with your IIT google account): https://drive.google.com/drive/folders/14mFrJDCge_JTxj56d-dJdCFD7vXQ-4yh?usp=sharing. In this semester you can select from the following papers:

Data Cleaning and Curation

  • Properties of Inconsistency Measures for Databases, Ester Livshits, Rina Kochirgan, Segev Tsur, Ihab F. Ilyas, Benny Kimelfeld, Sudeepa Roy, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, 2021.
  • FastVer: Making Data Integrity a Commodity, Arvind Arasu, Badrish Chandramouli, Johannes Gehrke, Esha Ghosh, Donald Kossmann, Jonathan Protzenko, Ravi Ramamurthy, Tahina Ramananandro, Aseem Rastogi, Srinath T. V. Setty, Nikhil Swamy, Alexander van Renen, Min Xu, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, 2021.
  • A SAT-Based System for Consistent Query Answering, Akhil A. Dixit, Phokion G. Kolaitis, Theory and Applications of Satisfiability Testing - SAT 2019 - 22nd International Conference, SAT 2019, Lisbon, Portugal, July 9-12, 2019, Proceedings, 2019.
  • A Hybrid Approach to Functional Dependency Discovery, Thorsten Papenbrock, Felix Naumann, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, 2016.
  • Lenses: An on-Demand Approach to ETL, Ying Yang, Niccolo Meneghetti, Ronny Fehling, Zhen Hua Liu, Oliver Kennedy, Proceedings of the VLDB Endowment, 2015.
  • Alphaclean: Automatic Generation of Data Cleaning Pipelines, Sanjay Krishnan, Eugene Wu, CoRR, 2019.
  • ABC of Order Dependencies, Pei Li, Jaroslaw Szlichta, Michael Böhlen, Divesh Srivastava, The VLDB Journal, 2021.

Integration, Matching, and Mappings

  • Reducing Ambiguity in Json Schema Discovery, William Spoth, Oliver Kennedy, Ying Lu, Beda Christoph Hammerschmidt, Zhen Hua Liu, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, 2021.
  • BEER: Blocking for Effective Entity Resolution, Sainyam Galhotra, Donatella Firmani, Barna Saha, Divesh Srivastava, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, 2021.
  • Finding Related Tables in Data Lakes for Interactive Data Science, Yi Zhang, Zachary G. Ives, Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, 2020.
  • Dataset Discovery in Data Lakes, Alex Bogatu, Alvaro A.A. Fernandes, Norman W. Paton, Nikolaos Konstantinou, 36th IEEE International Conference on Data Engineering, ICDE 2019, 2020.
  • Aurum: A Data Discovery System, Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, Michael Stonebraker, 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, 2018.

Data Provenance

  • LIMA: Fine-Grained Lineage Tracing and Reuse in Machine Learning Systems, Arnab Phani, Benjamin Rath, Matthias Boehm, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, 2021.
  • Equivalence-Invariant Algebraic Provenance for Hyperplane Update Queries, Pierre Bourhis, Daniel Deutch, Yuval Moskovitch, Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, 2020.
  • PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models, Yinjun Wu, Val Tannen, Susan B. Davidson, Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, 2020.
  • Summarizing Provenance of Aggregate Query Results in Relational Databases, Omar AlOmeir, Eugenie Yujing Lai, Mostafa Milani, Rachel Pottinger, 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021, 2021.
  • Fine-Grained Lineage for Safer Notebook Interactions, Stephen Macke, Aditya G. Parameswaran, Hongpu Gong, Doris Jung Lin Lee, Doris Xin, Andrew Head, Proc. VLDB Endow., 2021.
  • Explaining Natural Language Query Results, Daniel Deutch, Nave Frost, Amir Gilad, VLDB J., 2020.