Uncertainty-Annotated Databases
Uncertainty arises naturally in many application domains due to data entry errors, sensor errors and noise, uncertainty in information extraction and parsing, ambiguity from data integration, and heuristic data wrangling. Analyzing uncertain data without accounting for its uncertainty can create hard to trace errors with severe real world implications.
Incomplete and probabilistic database techniques are principled methods for dealing with uncertain data. However, the class of queries that can be answered efficiently over such databases is severely limited. In this project, we study a light-weight, practical alternative called uncertainty-annotated databases (UA-DBs). UA-DBs accommodate a wide range of uncertain data sources, are compatible with many existing incomplete and probabilistic data models, and, most importantly, query answering is efficient for a very large class of queries.
Implementations
We currently maintain two implementations of the UA-DB model. mimir-caveats written in Scala extends Apache Spark with uncertainty management capabilities and is the driver of uncertainty handling in our Vizier notebook system. Additionally, we have developed an implementation in GProm.
Collaborators
- Atri Rudra - SUNY Buffalo
- Oliver Kennedy - SUNY Buffalo
Funding
- NSF - Collaborative Research: III: MEDIUM: U4U - Taming Uncertainty with Uncertainty-Annotated Databases (2020 - 2024), $999,492, PIs: Boris Glavic Oliver Kennedy Atri Rudra
Publications
-
Efficient Management of Uncertain Data
Su Feng
Illinois Institute of Technology.@phdthesis{F23, venueshort = {PhD Thesis}, author = {Feng, Su}, keywords = {Uncertainty; UA-DB}, month = may, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/F23.pdf}, projects = {GProM; Vizier; UA-DB}, school = {Illinois Institute of Technology}, title = {Efficient Management of Uncertain Data}, year = {2023} }
-
Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds
Su Feng, Aaron Huber, Boris Glavic and Oliver Kennedy
Proceedings of the 46th International Conference on Management of Data (2021), pp. 528–540.@inproceedings{FH21, author = {Feng, Su and Huber, Aaron and Glavic, Boris and Kennedy, Oliver}, booktitle = {Proceedings of the 46th International Conference on Management of Data}, keywords = {UA-DB; Vizier}, pages = {528 – 540}, doi = {10.1145/3448016.3452791}, pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3448016.3452791}, projects = {Vizier; UA-DB}, video = {https://www.youtube.com/watch?v=si2HUS7idEs&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq}, title = {Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds}, venueshort = {SIGMOD}, reproducibility = {https://github.com/fengsu91/AUDB_Reproducibility}, longversionurl = {https://arxiv.org/pdf/2102.11796}, year = {2021} }
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. Unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed.We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.
-
Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers
Su Feng, Aaron Huber, Boris Glavic and Oliver Kennedy
Proceedings of the 44th International Conference on Management of Data (2019), pp. 1313–1330.@inproceedings{FH19, author = {Feng, Su and Huber, Aaron and Glavic, Boris and Kennedy, Oliver}, booktitle = {Proceedings of the 44th International Conference on Management of Data}, keywords = {UA-DB; Vizier}, longversionurl = {https://arxiv.org/pdf/1904.00234}, pages = {1313-1330}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/FH19.pdf}, reproducibility = {https://github.com/IITDBGroup/UADB_Reproducibility}, projects = {Vizier; UA-DB}, video = {https://av.tib.eu/media/43062}, doi = {10.1145/3299869.3319887}, slideurl = {https://www.slideshare.net/lordPretzel/2019-sigmod-uncertainty-annotated-databases-a-lightweight-approach-for-approximating-certain-answers}, title = {Uncertainty Annotated Databases - A Lightweight Approach for Approximating Certain Answers}, venueshort = {SIGMOD}, year = {2019} }
Certain answers are a principled method for coping with uncertainty that arises in many practical data management tasks. Unfortunately, this method is expensive and may exclude useful (if uncertain) answers. Thus, users frequently resort to less principled approaches to resolve uncertainty. In this paper, we propose Uncertainty Annotated Databases (UA-DBs), which combine an under- and over-approximation of certain answers to achieve the reliability of certain answers, with the performance of a classical database system. Furthermore, in contrast to prior work on certain answers, UA-DBs achieve a higher utility by including some (explicitly marked) answers that are not certain. UA-DBs are based on incomplete K-relations, which we introduce to generalize the classical set-based notion of incomplete databases and certain answers to a much larger class of data models. Using an implementation of our approach, we demonstrate experimentally that it efficiently produces tight approximations of certain answers that are of high utility.
-
Analyzing Uncertain Tabular Data
Oliver Kennedy and Boris Glavic
Information Quality in Information Fusion and Decision Making
Éloi Bossé and G. Rogova, eds. Springer. 291–320.@incollection{KG19, author = {Kennedy, Oliver and Glavic, Boris}, booktitle = {Information Quality in Information Fusion and Decision Making}, doi = {10.1007/978-3-030-03643-0_12}, editor = {\'{E}loi Boss\'{e} and Rogova, Galina}, keywords = {Uncertainty; UA-DB}, pages = {291-320}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/KG19.pdf}, projects = {Mimir; UA-DB; Vizier}, publisher = {Springer}, title = {Analyzing Uncertain Tabular Data}, venueshort = {Information Quality in Information Fusion and Decision Making}, year = {2019} }
It is common practice to spend considerable time refining source data to address issues of data quality before beginning any data analysis. For example, an analyst might impute missing values or detect and fuse duplicate records representing the same real-world entity. However, there are many situations where there are multiple possible candidate resolutions for a data quality issue, but there is not sufficient evidence for determining which of the resolutions is the most appropriate. In this case, the only way forward is to make assumptions to restrict the space of solutions and/or to heuristically choose a resolution based on characteristics that are deemed predictive of “good” resolutions. Although it is important for the analyst to understand the impact of these assumptions and heuristic choices on her results, evaluating this impact can be highly non-trivial and time consuming. For several decades now, the fields of probabilistic, incomplete, and fuzzy databases have developed strategies for analyzing the impact of uncertainty on the outcome of analyses. This general family of uncertainty-aware databases aims to model ambiguity in the results of analyses expressed in standard languages like SQL, SparQL, R, or Spark. An uncertainty-aware database uses descriptions of potential errors and ambiguities in source data to derive a corresponding description of potential errors or ambiguities in the result of an analysis accessing this source data. Depending on technique, these descriptions of uncertainty may be either quantitative (bounds, probabilities), or qualitative (certain outcomes, unknown values, explanations of uncertainty). In this chapter, we explore the types of problems that techniques from uncertainty-aware databases address, survey solutions to these problems, and highlight their application to fixing data quality issues.