CS520 - Data Integration, Warehousing, and Provenance - 2023 Fall

Course Webpage for CS520 - 2023 Fall taught by Boris Glavic

Data Curation Project

Organization

You will apply the techniques learned in class to clean and integrate one or more real world datasets. The data curation project will be done in the same groups as the paper review. You will have to:

  • Acquire or extract one or more real world datasets for a domain of choice.
  • Gain an understanding of the data and identify data quality issues
  • Research tools that are suited for the data cleaning, integration, extraction tasks that you need to apply

Deliverables

We will use Vizier an open source notebook system that is similar to Jupyther or Apache Zeppelin, but which provides additional functionality not found in these tools. We provide Vizier as a docker image for you to use, see https://github.com/IITDBGroup/cs520/blob/master/vizier/README.md.

  • Install docker and the Vizier docker image and implement a basic data analysis task with Vizier (see Vizier Setup Homework)
  • Vizier exported notebook (and other code) committed to your group's git repository.
  • The notebook also serves as a report describing your data curation project.

Source Code and Dataset Management

The project will be completed in groups of three students. Groups will be determined in the first days of class. The groups will do both the project and paper reviews together.

For each group we create a repository and team on Github. Git is a distributed version control system. Good introductions to git are gitmagic and the official git documentation.

Once groups are finalized, you will receive an invitation to collaborate on a shared github repository named cs520-f23-groupnumber. All your work in this class will be submitted via your shared private repository. For large files consider using cloud storage. If the dataset is publicly available, then including a file with links in the repository is sufficient.

Presentation

The presentations will be given in a single block session on 11/30 in zoom . You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. You will find the detailed presentation schedule here once it has been set: Seminar Schedule. The presentation will be 10 min long and should cover the following:

  • Introduce the dataset(s), why you have chosen them, and how you have acquired them: what is the domain (e.g., chicago parking data)? what are the characteristics (dataset size, number of attributes, what data format)? …
  • Give an overview of data quality problems you have identified in the data and methodology/tools used to identify them.
  • Explain how you have overcome (tried to overcome) these problems, what tools were used, and what were the challenges.

Ideas for datasets

Last updated on 1 Aug 2023