Vizier
Funding
- NSF - REU Supplement for Collaborative Research: III: MEDIUM: U4U - Taming Uncertainty with Uncertainty-Annotated Databases (2021 - 2022), $16,000, PIs: Boris Glavic
- NSF - CIF21 DIBBs: EI: Vizier, Streamlined Data Curation (2017 - 2020), $2,725,699, PIs: Boris Glavic Juliana Freire Oliver Kennedy
- NSF - REU Supplement for CIF21 DIBBs: EI: Vizier, Streamlined Data Curation (2017 - 2020), $24,000, PIs: Boris Glavic Juliana Freire Oliver Kennedy
Publications
-
Overlay Spreadsheets
Oliver Kennedy, Boris Glavic and Michael Brachmann
Proceedings of the Workshop on Human-In-the-Loop Data Analytics, HILDA 2023, Seattle, WA, USA, 18 June 2023 (2023), pp. 4:1–4:7.@inproceedings{KG23, author = {Kennedy, Oliver and Glavic, Boris and Brachmann, Michael}, isworkshop = {true}, keywords = {Vizier; Spreadsheets}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/KG23.pdf}, projects = {Vizier}, slides = {https://odin.cse.buffalo.edu/talks/2023-06-18-HILDA.html}, doi = {10.1145/3597465.3605220}, booktitle = {Proceedings of the Workshop on Human-In-the-Loop Data Analytics, {HILDA} 2023, Seattle, WA, USA, 18 June 2023}, pages = {4:1--4:7}, publisher = {{ACM}}, title = {{Overlay Spreadsheets}}, venueshort = {HILDA}, year = {2023} }
-
Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds
Su Feng, Aaron Huber, Boris Glavic and Oliver Kennedy
Proceedings of the 46th International Conference on Management of Data (2021), pp. 528–540.@inproceedings{FH21, author = {Feng, Su and Huber, Aaron and Glavic, Boris and Kennedy, Oliver}, booktitle = {Proceedings of the 46th International Conference on Management of Data}, keywords = {UA-DB; Vizier}, pages = {528 – 540}, doi = {10.1145/3448016.3452791}, pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3448016.3452791}, projects = {Vizier; UA-DB}, video = {https://www.youtube.com/watch?v=si2HUS7idEs&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq}, title = {Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds}, venueshort = {SIGMOD}, reproducibility = {https://github.com/fengsu91/AUDB_Reproducibility}, longversionurl = {https://arxiv.org/pdf/2102.11796}, year = {2021} }
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed.We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.
-
Your notebook is not crumby enough, REPLace it
Michael Brachmann, William Spoth, Oliver Kennedy, Boris Glavic, Heiko Müller, Sonia Castelo, Carlos Bautista and Juliana Freire
Proceedings of the 10th Conference on Innovative Data Systems (2020).@inproceedings{BS20, author = {Brachmann, Michael and Spoth, William and Kennedy, Oliver and Glavic, Boris and M\"{u}ller, Heiko and Castelo, Sonia and Bautista, Carlos and Freire, Juliana}, title = {Your notebook is not crumby enough, REPLace it}, booktitle = {Proceedings of the 10th Conference on Innovative Data Systems}, year = {2020}, projects = {Vizier}, keywords = {Notebooks; Vizier}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/BS20.pdf}, venueshort = {CIDR} }
Notebook and spreadsheet systems are currently the defacto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets.
-
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Su Feng, Aaron Huber, Boris Glavic and Oliver Kennedy
Proceedings of the 44th International Conference on Management of Data (2019), pp. 1313–1330.@inproceedings{FH19, author = {Feng, Su and Huber, Aaron and Glavic, Boris and Kennedy, Oliver}, booktitle = {Proceedings of the 44th International Conference on Management of Data}, keywords = {UA-DB; Vizier}, longversionurl = {https://arxiv.org/pdf/1904.00234}, pages = {1313-1330}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/FH19.pdf}, reproducibility = {https://github.com/IITDBGroup/UADB_Reproducibility}, projects = {Vizier; UA-DB}, video = {https://av.tib.eu/media/43062}, doi = {10.1145/3299869.3319887}, slideurl = {https://www.slideshare.net/lordPretzel/2019-sigmod-uncertainty-annotated-databases-a-lightweight-approach-for-approximating-certain-answers}, title = {Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers}, venueshort = {SIGMOD}, year = {2019} }
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
-
Data Debugging and Exploration with Vizier
Mike Brachmann, Carlos Bautista, Sonia Castelo, Su Feng, Juliana Freire, Boris Glavic, Oliver Kennedy, Heiko Müller, Rémi Rampin, William Spoth and Ying Yang
Proceedings of the 44th International Conference on Management of Data (Demonstration Track) (2019), pp. 1877–1880.@inproceedings{BB19, author = {Brachmann, Mike and Bautista, Carlos and Castelo, Sonia and Feng, Su and Freire, Juliana and Glavic, Boris and Kennedy, Oliver and M{\"u}ller, Heiko and Rampin, R{\'e}mi and Spoth, William and Yang, Ying}, booktitle = {Proceedings of the 44th International Conference on Management of Data (Demonstration Track)}, date-modified = {2019-04-04 12:25:42 -0500}, keywords = {Vizier}, pages = {1877-1880}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/BB19.pdf}, projects = {Vizier}, video = {https://www.youtube.com/watch?v=c3ICB-17kRY&t=4s}, doi = {10.1145/3299869.3320246}, title = {Data Debugging and Exploration with Vizier}, venueshort = {SIGMOD}, year = {2019} }
We present Vizier, a multi-modal data exploration and debugging tool. The system supports a wide range of operations by seamlessly integrating Python, SQL, and automated data curation and debugging methods. Using Spark as an execution backend, Vizier handles large datasets in multiple formats. Ease-of-use is attained through integration of a notebook with a spreadsheet-style interface and with visualizations that guide and support the user in the loop. In addition, native support for provenance and versioning enable collaboration and uncertainty management. In this demonstration we will illustrate the diverse features of the system using several realistic data science tasks based on real data.
-
The Exception that Improves the Rule
Juliana Freire, Boris Glavic, Oliver Kennedy and Heiko Müller
SIGMOD Workshop on Human-In-the-Loop Data Analytics (2016).@inproceedings{FG16, author = {Freire, Juliana and Glavic, Boris and Kennedy, Oliver and M\"{u}ller, Heiko}, booktitle = {SIGMOD Workshop on Human-In-the-Loop Data Analytics}, isworkshop = {true}, keywords = {Vizier; Provenance; Data Cleaning}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/FG16.pdf}, projects = {Vizier}, title = {{The Exception that Improves the Rule}}, venueshort = {HILDA}, year = {2016}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/FG16.pdf} }