Big Data Provenance: Challenges and Implications for Benchmarking
Authors
Boris Glavic
Abstract
Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data - which we refer to as Big Provenance - is a largely unexplored field. This paper reviews existing approaches for large-scale distributed provenance and discusses potential challenges for Big Data benchmarks that aim to incorporate provenance data/management. Furthermore, we will examine how Big Data benchmarking could benefit from different types of provenance information. We argue that provenance can be used for identifying and analyzing performance bottlenecks, to compute performance metrics, and to test a system's ability to exploit commonalities in data and processing.
Links
Reference
Big Data Provenance: Challenges and Implications for Benchmarking (Boris Glavic), In 2nd Workshop on Big Data Benchmarking (WBDB), 2012.
Bibtex Entry
@inproceedings{G13,
	Abstract = {Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data - which we refer to as Big Provenance - is a largely unexplored field. This paper reviews existing approaches for large-scale distributed provenance and discusses potential challenges for Big Data benchmarks that aim to incorporate provenance data/management. Furthermore, we will examine how Big Data benchmarking could benefit from different types of provenance information. We argue that provenance can be used for identifying and analyzing performance bottlenecks, to compute performance metrics, and to test a system's ability to exploit commonalities in data and processing.},
	Author = {Boris Glavic},
	Booktitle = {2nd Workshop on Big Data Benchmarking},
	Keywords = {Big Data; Provenance; Big Provenance},
	Pages = {72-80},
	Slideurl = {http://www.slideshare.net/lordPretzel/wbdb-2012-wbdb},
	Title = {Big Data Provenance: Challenges and Implications for Benchmarking},
	Url = {http://cs.iit.edu/%7edbgroup/pdfpubls/G13.pdf},
	Venueshort = {WBDB},
	Year = {2012},
	Bdsk-Url-1 = {http://cs.iit.edu/%7edbgroup/pdfpubls/G13.pdf}}
Powered by bibtexbrowser