Analyzing Uncertain Tabular Data

Authors

Materials

Abstract

It is common practice to spend considerable time refining source data to address issues of data quality before beginning any data analysis. For example, an analyst might impute missing values or detect and fuse duplicate records representing the same real-world entity. However, there are many situations where there are multiple possible candidate resolutions for a data quality issue, but there is not sufficient evidence for determining which of the resolutions is the most appropriate. In this case, the only way forward is to make assumptions to restrict the space of solutions and/or to heuristically choose a resolution based on characteristics that are deemed predictive of “good” resolutions. Although it is important for the analyst to understand the impact of these assumptions and heuristic choices on her results, evaluating this impact can be highly non-trivial and time consuming. For several decades now, the fields of probabilistic, incomplete, and fuzzy databases have developed strategies for analyzing the impact of uncertainty on the outcome of analyses. This general family of uncertainty-aware databases aims to model ambiguity in the results of analyses expressed in standard languages like SQL, SparQL, R, or Spark. An uncertainty-aware database uses descriptions of potential errors and ambiguities in source data to derive a corresponding description of potential errors or ambiguities in the result of an analysis accessing this source data. Depending on technique, these descriptions of uncertainty may be either quantitative (bounds, probabilities), or qualitative (certain outcomes, unknown values, explanations of uncertainty). In this chapter, we explore the types of problems that techniques from uncertainty-aware databases address, survey solutions to these problems, and highlight their application to fixing data quality issues.

bibtex

@incollection{KG19,
  author = {Kennedy, Oliver and Glavic, Boris},
  booktitle = {Information Quality in Information Fusion and Decision Making},
  doi = {10.1007/978-3-030-03643-0_12},
  editor = {\'{E}loi Boss\'{e} and Rogova, Galina},
  keywords = {Uncertainty; UA-DB},
  pages = {291-320},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/KG19.pdf},
  projects = {Mimir; UA-DB; Vizier},
  publisher = {Springer},
  title = {Analyzing Uncertain Tabular Data},
  venueshort = {Information Quality in Information Fusion and Decision Making},
  year = {2019}
}

Reference

Analyzing Uncertain Tabular Data Oliver Kennedy and Boris Glavic Information Quality in Information Fusion and Decision Making Éloi Bossé and G. Rogova, eds. Springer. 291–320.