Big Provenance
The sheer amount of data available in the Big Data age necessitates the application of techniques such as classification and aggregation that extract meaningful information from this data to be consumed humans. To understand the validity of extracted information and how it was derived from which input data, a human analyst would need to be able to investigate the extraction process and explore which inputs lead to a particular result, i.e., analyze the result’s provenance. In addition to explaining how a result was derived from which input data, provenance information is used for auditing, verification of results, resolving conflicts among data sources, establishing ownership of data, and evaluating data quality. The objective of this project is to make provenance useable for Big Data environments. In this context, we will study the following research questions:
- How to port provenance tracking techniques from relational database to Big Data platforms.
- How to leverage data summarization techniques to create compact representations of provenance that are meaningful to a human. Investigate how these compact descriptions of provenance information can be created without having to store all input data of the extraction process for an indefinite amount of time.
- How to identity which input data items have the most influence on a piece of extracted information.
- Support efficient, interactive exploration of provenance through iterative refinement of condensed and approximate representations.
Collaborators
- Dieter Gawlick - Oracle
- Vasudha Krishnaswamy - Oracle
- Venkatesh Radhakrishnan
- Zhen Hua Liu - Oracle
Funding
- Oracle - Provenance for Big Data (2017 - 2018), $90,871, PIs: Boris Glavic
Publications
-
Big Data Provenance: Challenges and Implications for Benchmarking
Boris Glavic
2nd Workshop on Big Data Benchmarking (2012), pp. 72–80.@inproceedings{G13, author = {Glavic, Boris}, booktitle = {2nd Workshop on Big Data Benchmarking}, isworkshop = {true}, keywords = {Big Data; Provenance; Big Provenance}, pages = {72-80}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/G13.pdf}, projects = {Big Provenance}, slideurl = {http://www.slideshare.net/lordPretzel/wbdb-2012-wbdb}, title = {Big Data Provenance: Challenges and Implications for Benchmarking}, venueshort = {WBDB}, year = {2012}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/G13.pdf} }
Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data - which we refer to as Big Provenance - is a largely unexplored field. This paper reviews existing approaches for large-scale distributed provenance and discusses potential challenges for Big Data benchmarks that aim to incorporate provenance data/management. Furthermore, we will examine how Big Data benchmarking could benefit from different types of provenance information. We argue that provenance can be used for identifying and analyzing performance bottlenecks, to compute performance metrics, and to test a system’s ability to exploit commonalities in data and processing.