Provenance has been studied extensively for relational queries and shown to be important in revealing the origin and creation process of data that has been produced by potentially complex relational transformations. Provenance for the results of data mining operators in contrast has not been considered. We argue that provenance offers the same benefits for mining as for relational queries, e.g., it allows us to track errors caused by incorrect input data. We consider the most common mining operator, frequent itemset mining, and introduce two types of provenance (why- and i-provenance) for this operator. We argue that the concept of why-provenance for relational queries can be adapted for frequent itemsets, but that it poses new computational challenges due to the nature of itemset mining and the size of why-provenance. We address these challenges in two ways. First, we propose combining why-provenance computation with SQL querying to permit users to select small and more intuitive representations of the provenance, and second by proposing new compression techniques for the why-provenance. Next, we introduce a new provenance type called i-provenance (itemset provenance) that succinctly represents the interdependencies between items and transactions that explain how a frequent itemset was derived (intuitively giving insight into the structure of the data that provides the evidence for the itemset). We present techniques for efficient storage and use of both types of provenance information and experimentally evaluate the scalability of our approach. We argue through a set of examples that why- and i-provenance can add significant value to mining results and can be used to analyze the context of the transactions that caused an itemset to be frequent and to understand how combinations of itemsets contribute to a result.
@techreport{SG13, author = {Siddique, Javed and Glavic, Boris and Miller, Ren\'{e}e J.}, date-added = {2013-05-13 14:03:56 +0000}, date-modified = {2013-05-13 14:18:04 +0000}, institution = {University of Toronto}, keywords = {Provenance; Data Mining}, pdfurl = {http://dblab.cs.toronto.edu/project/provenance4mining/docs/fimprov_main.pdf}, title = {Provenance Management for Frequent Itemsets}, venueshort = {Techreport}, year = {2013}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SG13.pdf} }