Provenance for Data Mining (bibtex)
by Boris Glavic, Javed Siddique, Periklis Andritsos, Renée J. Miller
Abstract:
Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.
Reference:
Provenance for Data Mining (Boris Glavic, Javed Siddique, Periklis Andritsos, Renée J. Miller), In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2013.
Bibtex Entry:
@inproceedings{GS13,
	Abstract = {Data mining aims at extracting useful information from large datasets. Most data mining approaches reduce the input data to produce a smaller output summarizing the mining result. While the purpose of data mining (extracting information) necessitates this reduction in size, the loss of information it entails can be problematic. Specifically, the results of data mining may be more confusing than insightful, if the user is not able to understand on which input data they are based and how they were created. In this paper, we argue that the user needs access to the provenance of mining results. Provenance, while extensively studied by the database, workflow, and distributed systems communities, has not yet been considered for data mining. We analyze the differences between database, workflow, and data mining provenance, suggest new types of provenance, and identify new use-cases for provenance in data mining. To illustrate our ideas, we present a more detailed discussion of these concepts for two typical data mining algorithms: frequent itemset mining and multi-dimensional scaling.},
	Author = {Boris Glavic and Javed Siddique and Periklis Andritsos and Ren\'{e}e J. Miller},
	Booktitle = {Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP)},
	Date-Added = {2013-05-13 14:00:58 +0000},
	Date-Modified = {2013-05-13 14:01:56 +0000},
	Keywords = {Provenance},
	Slideurl = {http://www.slideshare.net/lordPretzel/tapp-2013},
	Title = {Provenance for Data Mining},
	Url = {http://cs.iit.edu/%7edbgroup/pdfpubls/GS13.pdf},
	Venueshort = {TaPP},
	Year = {2013},
	Bdsk-Url-1 = {http://cs.iit.edu/%7edbgroup/pdfpubls/GS13.pdf}}
Powered by bibtexbrowser