Notebook and spreadsheet systems are currently the defacto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets.
@inproceedings{BS20, author = {Brachmann, Michael and Spoth, William and Kennedy, Oliver and Glavic, Boris and M\"{u}ller, Heiko and Castelo, Sonia and Bautista, Carlos and Freire, Juliana}, title = {Your notebook is not crumby enough, REPLace it}, booktitle = {Proceedings of the 10th Conference on Innovative Data Systems}, year = {2020}, projects = {Vizier}, keywords = {Notebooks; Vizier}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/BS20.pdf}, venueshort = {CIDR} }