Organization
You will apply the techniques learned in class to clean and integrate one or more real world datasets. The data curation project will be done in the same groups as the paper review. You will have to:
- Acquire or extract one or more real world datasets for a domain of choice.
- Gain an understanding of the data and identify data quality issues
- Research tools that are suited for the data cleaning, integration, extraction tasks that you need to apply
Deliverables
- Jupyther notebook (and other code) committed to your group's git repository.
- The notebook also serves as a report describing your data curation project.
Source Code and Dataset Management
The project will be completed in groups of three students. Groups will be determined in the first days of class. The groups will do both the project and paper reviews together.
Each group will get their own git repository on BitBucket. You do not need to create a repository yourself - we will create a repository for every group in the course. Git is a distributed version control system. Good introductions to git are gitmagic and the official git documentation.
Once groups are finalized, you will receive an invitation to collaborate on a shared BitBucket repository named cs525-s18-groupnumber. All your work in this class will be submitted via your shared private repository on BitBucket. For large files consider using git lfs or cloud storage. If the dataset is publicly available, then including a file with links in the repository is sufficient.
Presentation
The presentations will be given in a single block session on TBA in room TBA. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. The presentation will be 10 min long and should cover the following:
- Introduce the dataset(s), why you have chosen them, and how you have acquired them: what is the domain (e.g., chicago parking data)? what are the characteristics (dataset size, number of attributes, what data format)? ...
- Give an overview of data quality problems you have identified in the data and methodology/tools used to identify them.
- Explain how you have overcome (tried to overcome) these problems, what tools were used, and what were the challenges.
Ideas for datasets
- Open government initiatives, e.g., City of Chicago Data Portal
- Extract from web or web services, e.g., twitter