CS 520: Data Integration, Warehousing, and Provenance - 2020 Spring

Organization

You will apply the techniques learned in class to clean and integrate one or more real world datasets. The data curation project will be done in the same groups as the paper review. You will have to:

Deliverables

We will use Vizier an open source notebook system that is similar to Jupyther or Apache Zeppelin, but which provides additional functionality not found in these tools. Vizier is available at https://vizierdb.info/.

Source Code and Dataset Management

The project will be completed in groups of three students. Groups will be determined in the first days of class. The groups will do both the project and paper reviews together.

Each group will get their own git repository on BitBucket. You do not need to create a repository yourself - we will create a repository for every group in the course. Git is a distributed version control system. Good introductions to git are gitmagic and the official git documentation.

Once groups are finalized, you will receive an invitation to collaborate on a shared BitBucket repository named cs525-s18-groupnumber. All your work in this class will be submitted via your shared private repository on BitBucket. For large files consider using git lfs or cloud storage. If the dataset is publicly available, then including a file with links in the repository is sufficient.

Presentation

The presentations will be given in a single block session on TBA in room TBA. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. The presentation will be 10 min long and should cover the following:

Ideas for datasets