Provenance Unification through Graphs with Summarization (PUGS)
Provenance for relational queries records how results of a query depend on the query’s inputs. This type of information can be used to explain why (and how) a result is derived by a query over a given database. Recently, approaches have been developed that use provenance-like techniques to explain why a tuple (or a set of tuples described declaratively by a pattern) is missing from the query result. However, the two problems of computing provenance and explaining missing answers have been treated mostly in isolation.
Unifying Why and Why-not Provenance
In collaboration with UIUC, we have developed a unified approach for efficiently computing provenance (why) and missing answers (why-not). This approach is based on the observation that in provenance model for queries with unsafe negation, why-not questions can be translated into why questions and vice versa.
Typically only a part of the provenance, which we call explanation, is actually relevant for answering the user’s provenance question about the existence or absence of a result. We have developed an approach that tries to restrict provenance capture to what is relevant to explain the outcome of interest specified by the user.
Approximate Summarization of Provenance
While explanations are useful for reducing the size of provenance, the result may still overwhelm users with too much information and waste computational resources. In particular for why-not, where the provenance explains all failed ways of how a result could have been derived using the rules of a query, provenance graphs may be too large to compute even for small datasets. We address the computational and usability challenges of large provenance graphs by creating summaries based on structural commonalities that exist in the provenance. Importantly, our approach computes summaries based on a sample of the provenance only and, thus, avoids the computationally infeasible step of generating the full why-not provenance for a user question. We have implemented these techniques in our PUGS system.
The PUGS System
We have build a system which supports computing explanations for a user's why and why-not provenance question as a, optionally summarized, provenance graph. Our approach offloads the computation that captures relevant provenance and summarizes it to a datalog engine or relational database backend. The figure below gives an overview of the architecture of the proposed approach.
Collaborators
- Bertram Ludäscher - UIUC
- Sven Köhler - Google
Publications
-
Approximate Summaries for Why and Why-not Provenance
Seokki Lee, Bertram Ludäscher and Boris Glavic
Proceedings of the VLDB Endowment. 13, 6 (2020) , 912–924.@article{LL20, author = {Lee, Seokki and Lud\"ascher, Bertram and Glavic, Boris}, journal = {Proceedings of the VLDB Endowment}, keywords = {PUGS; Summarization; Sampling; Missing Answers; Datalog}, longversionurl = {https://arxiv.org/pdf/2002.00084}, projects = {PUGS}, number = {6}, pages = {912 - 924}, pdfurl = {http://www.vldb.org/pvldb/vol13/p912-lee.pdf}, title = {{Approximate Summaries for Why and Why-not Provenance}}, venueshort = {PVLDB}, volume = {13}, year = {2020} }
Why and why-not provenance have been studied extensively in recent years. However, why-not provenance and — to a lesser degree — why provenance can be very large, resulting in severe scalability and usability challenges. We introduce a novel approximate summarization technique for provenance to address these challenges. Our approach uses patterns to encode why and why-not provenance concisely. We develop techniques for efficiently computing provenance summaries that balance informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to both scale to large datasets and to generate comprehensive and meaningful summaries.
-
Why and Why-Not Provenance for Queries with Negation
Seokki Lee
Illinois Institute of Technology.@phdthesis{lee-20-wwnpqn, venueshort = {PhD Thesis}, author = {Lee, Seokki}, keywords = {PUGS; Summarization; Missing Answers; Datalog}, month = may, pdfurl = {https://search-proquest-com.ezproxy.gl.iit.edu/pqdtglobal/docview/2424512806/fulltextPDF/4B76E2738F99473BPQ/1?accountid=28377}, project = {PUGS}, school = {Illinois Institute of Technology}, title = {Why and Why-Not Provenance for Queries with Negation}, year = {2020} }
-
PUG: a framework and practical implementation for why and why-not provenance
Seokki Lee, Bertram Ludäscher and Boris Glavic
The VLDB Journal. 28, 1 (Aug. 2019) , 47—71.@article{LL18, author = {Lee, Seokki and Lud{\"a}scher, Bertram and Glavic, Boris}, date-added = {2018-08-29 19:09:06 -0500}, date-modified = {2018-08-29 19:09:33 -0500}, day = {23}, doi = {10.1007/s00778-018-0518-5}, issn = {0949-877X}, issue = {1}, journal = {The VLDB Journal}, keywords = {Datalog; Provenance; Missing Answers; Semirings; PUGS}, longversionurl = {https://arxiv.org/pdf/1808.05752.pdf}, month = aug, pages = {47---71}, projects = {PUGS}, title = {PUG: a framework and practical implementation for why and why-not provenance}, venueshort = {VLDBJ}, volume = {28}, year = {2019} }
Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency.
-
Provenance Summaries for Answers and Non-Answers
Seokki Lee, Bertram Ludäscher and Boris Glavic
Proceedings of the VLDB Endowment (Demonstration Track). 11, 12 (2018) , 1954–1957.@article{LGG18, author = {Lee, Seokki and Lud{\"{a}}scher, Bertram and Glavic, Boris}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, keywords = {PUGS; Datalog; Provenance; Missing Answers}, number = {12}, pages = {1954--1957}, pdfurl = {http://www.vldb.org/pvldb/vol11/p1954-lee.pdf}, projects = {PUGS}, title = {Provenance Summaries for Answers and Non-Answers}, venueshort = {{PVLDB}}, volume = {11}, year = {2018} }
-
A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 33rd IEEE International Conference on Data Engineering (2017), pp. 485–496.@inproceedings{LS17, author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, booktitle = {Proceedings of the 33rd IEEE International Conference on Data Engineering}, keywords = {Provenance; Datalog; GProM; Missing Answers; Game Provenance; PUGS}, pages = {485-496}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS17.pdf}, projects = {GProM; PUGS}, title = {A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries}, venueshort = {ICDE}, year = {2017} }
-
Integrating Approximate Summarization with Provenance Capture
Seokki Lee, Xing Niu, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (2017).@inproceedings{SN17, author = {Lee, Seokki and Niu, Xing and Lud\"{a}scher, Bertram and Glavic, Boris}, booktitle = {Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance}, isworkshop = {true}, keywords = {Provenance; Datalog; GProM; Missing Answers; Game Provenance; PUGS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SN17.pdf}, projects = {GProM; PUGS}, title = {Integrating Approximate Summarization with Provenance Capture}, venueshort = {TaPP}, year = {2017}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SN17.pdf} }
-
Implementing Unified Why- and Why-Not Provenance Through Games
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster) (2016).@inproceedings{LS16, author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, booktitle = {Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster)}, isworkshop = {true}, keywords = {Provenance; Game Provenance; Datalog; GProM; Missing Answers; PUGS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16.pdf}, projects = {PUGS}, title = {{Implementing Unified Why- and Why-Not Provenance Through Games}}, venueshort = {TaPP}, year = {2016}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16.pdf} }
-
An Efficient Implementation of Game Provenance in DBMS
Seokki Lee, Yuchen Tang, Sven Köhler, Bertram Ludäscher and Boris Glavic
Technical Report #IIT/CS-DB-2015-02
Illinois Institute of Technology.@techreport{LW15a, author = {Lee, Seokki and Tang, Yuchen and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, date-modified = {2015-10-22 12:15:28 +0000}, institution = {Illinois Institute of Technology}, keywords = {Provenance; Game Provenance; Datalog; GProM; Missing Answers; PUGS}, number = {IIT/CS-DB-2015-02}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15a.pdf}, projects = {PUGS}, title = {An Efficient Implementation of Game Provenance in DBMS}, venueshort = {Techreport}, year = {2015}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15a.pdf} }
-
Towards Constraint-based Explanations for Answers and Non-Answers
Boris Glavic, Sven Köhler, Sean Riddle and Bertram Ludäscher
Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (2015).@inproceedings{GK15, author = {Glavic, Boris and K\"{o}hler, Sven and Riddle, Sean and Lud\"{a}scher, Bertram}, booktitle = {Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance}, isworkshop = {true}, keywords = {Provenance;Missing Answers;Summarization;Datalog;Game Provenance; PUGS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GK15.pdf}, projects = {PUGS}, slideurl = {http://www.slideshare.net/lordPretzel/2015-ta-ppwhynotpptx}, title = {Towards Constraint-based Explanations for Answers and Non-Answers}, venueshort = {TaPP}, year = {2015}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GK15.pdf} }