Pug

Provenance Unification through Graphs with Summarization (PUGS)

Provenance for relational queries records how results of a query depend on the query’s inputs. This type of information can be used to explain why (and how) a result is derived by a query over a given database. Recently, approaches have been developed that use provenance-like techniques to explain why a tuple (or a set of tuples described declaratively by a pattern) is missing from the query result. However, the two problems of computing provenance and explaining missing answers have been treated mostly in isolation.

Unifying Why and Why-not Provenance

In collaboration with UIUC, we have developed a unified approach for efficiently computing provenance (why) and missing answers (why-not). This approach is based on the observation that in provenance model for queries with unsafe negation, why-not questions can be translated into why questions and vice versa.

Typically only a part of the provenance, which we call explanation, is actually relevant for answering the user’s provenance question about the existence or absence of a result. We have developed an approach that tries to restrict provenance capture to what is relevant to explain the outcome of interest specified by the user.

Approximate Summarization of Provenance

While explanations are useful for reducing the size of provenance, the result may still overwhelm users with too much information and waste computational resources. In particular for why-not, where the provenance explains all failed ways of how a result could have been derived using the rules of a query, provenance graphs may be too large to compute even for small datasets. We address the computational and usability challenges of large provenance graphs by creating summaries based on structural commonalities that exist in the provenance. Importantly, our approach computes summaries based on a sample of the provenance only and, thus, avoids the computationally infeasible step of generating the full why-not provenance for a user question. We have implemented these techniques in our PUGS system.

The PUGS System

We have build a system which supports computing explanations for a user's why and why-not provenance question as a, optionally summarized, provenance graph. Our approach offloads the computation that captures relevant provenance and summarizes it to a datalog engine or relational database backend. The figure below gives an overview of the architecture of the proposed approach.

Collaborators

Bertram Ludäscher - UIUC
Sven Köhler - Google

Publications

Approximate Summaries for Why and Why-not Provenance
Seokki Lee, Bertram Ludäscher and Boris Glavic
Proceedings of the VLDB Endowment. 13, 6 (2020) , 912–924.
```
@article{LL20,
  author = {Lee, Seokki and Lud\"ascher, Bertram and Glavic, Boris},
  journal = {Proceedings of the VLDB Endowment},
  keywords = {PUGS; Summarization; Sampling; Missing Answers; Datalog},
  longversionurl = {https://arxiv.org/pdf/2002.00084},
  projects = {PUGS},
  number = {6},
  pages = {912 - 924},
  pdfurl = {http://www.vldb.org/pvldb/vol13/p912-lee.pdf},
  title = {{Approximate Summaries for Why and Why-not Provenance}},
  venueshort = {PVLDB},
  volume = {13},
  year = {2020}
}
```
Why and why-not provenance have been studied extensively in recent years. However, why-not provenance and — to a lesser degree — why provenance can be very large, resulting in severe scalability and usability challenges. We introduce a novel approximate summarization technique for provenance to address these challenges. Our approach uses patterns to encode why and why-not provenance concisely. We develop techniques for efficiently computing provenance summaries that balance informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to both scale to large datasets and to generate comprehensive and meaningful summaries.
details

Why and Why-Not Provenance for Queries with Negation
Seokki Lee
Illinois Institute of Technology.

pdf
details

@phdthesis{lee-20-wwnpqn,
  venueshort = {PhD Thesis},
  author = {Lee, Seokki},
  keywords = {PUGS; Summarization; Missing Answers; Datalog},
  month = may,
  pdfurl = {https://search-proquest-com.ezproxy.gl.iit.edu/pqdtglobal/docview/2424512806/fulltextPDF/4B76E2738F99473BPQ/1?accountid=28377},
  project = {PUGS},
  school = {Illinois Institute of Technology},
  title = {Why and Why-Not Provenance for Queries with Negation},
  year = {2020}
}

details

PUG: a framework and practical implementation for why and why-not provenance
Seokki Lee, Bertram Ludäscher and Boris Glavic
The VLDB Journal. 28, 1 (Aug. 2019) , 47—71.
```
@article{LL18,
  author = {Lee, Seokki and Lud{\"a}scher, Bertram and Glavic, Boris},
  date-added = {2018-08-29 19:09:06 -0500},
  date-modified = {2018-08-29 19:09:33 -0500},
  day = {23},
  doi = {10.1007/s00778-018-0518-5},
  issn = {0949-877X},
  issue = {1},
  journal = {The VLDB Journal},
  keywords = {Datalog; Provenance; Missing Answers; Semirings; PUGS},
  longversionurl = {https://arxiv.org/pdf/1808.05752.pdf},
  month = aug,
  pages = {47---71},
  projects = {PUGS},
  title = {PUG: a framework and practical implementation for why and why-not provenance},
  venueshort = {VLDBJ},
  volume = {28},
  year = {2019}
}
```
Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency.
details

Provenance Summaries for Answers and Non-Answers
Seokki Lee, Bertram Ludäscher and Boris Glavic
Proceedings of the VLDB Endowment (Demonstration Track). 11, 12 (2018) , 1954–1957.

pdf
details

@article{LGG18,
  author = {Lee, Seokki and Lud{\"{a}}scher, Bertram and Glavic, Boris},
  journal = {Proceedings of the VLDB Endowment (Demonstration Track)},
  keywords = {PUGS; Datalog; Provenance; Missing Answers},
  number = {12},
  pages = {1954--1957},
  pdfurl = {http://www.vldb.org/pvldb/vol11/p1954-lee.pdf},
  projects = {PUGS},
  title = {Provenance Summaries for Answers and Non-Answers},
  venueshort = {{PVLDB}},
  volume = {11},
  year = {2018}
}

details

A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 33rd IEEE International Conference on Data Engineering (2017), pp. 485–496.

pdf
details

@inproceedings{LS17,
  author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris},
  booktitle = {Proceedings of the 33rd IEEE International Conference on Data Engineering},
  keywords = {Provenance; Datalog; GProM; Missing Answers; Game Provenance; PUGS},
  pages = {485-496},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS17.pdf},
  projects = {GProM; PUGS},
  title = {A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries},
  venueshort = {ICDE},
  year = {2017}
}

details

Integrating Approximate Summarization with Provenance Capture
Seokki Lee, Xing Niu, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (2017).

pdf
details

@inproceedings{SN17,
  author = {Lee, Seokki and Niu, Xing and Lud\"{a}scher, Bertram and Glavic, Boris},
  booktitle = {Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance},
  isworkshop = {true},
  keywords = {Provenance; Datalog; GProM; Missing Answers; Game Provenance; PUGS},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SN17.pdf},
  projects = {GProM; PUGS},
  title = {Integrating Approximate Summarization with Provenance Capture},
  venueshort = {TaPP},
  year = {2017},
  bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SN17.pdf}
}

details

Implementing Unified Why- and Why-Not Provenance Through Games
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster) (2016).

pdf
details

@inproceedings{LS16,
  author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris},
  booktitle = {Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster)},
  isworkshop = {true},
  keywords = {Provenance; Game Provenance; Datalog; GProM; Missing Answers; PUGS},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16.pdf},
  projects = {PUGS},
  title = {{Implementing Unified Why- and Why-Not Provenance Through Games}},
  venueshort = {TaPP},
  year = {2016},
  bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16.pdf}
}

details

An Efficient Implementation of Game Provenance in DBMS
Seokki Lee, Yuchen Tang, Sven Köhler, Bertram Ludäscher and Boris Glavic
Technical Report #IIT/CS-DB-2015-02
Illinois Institute of Technology.

pdf
details

@techreport{LW15a,
  author = {Lee, Seokki and Tang, Yuchen and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris},
  date-modified = {2015-10-22 12:15:28 +0000},
  institution = {Illinois Institute of Technology},
  keywords = {Provenance; Game Provenance; Datalog; GProM; Missing Answers; PUGS},
  number = {IIT/CS-DB-2015-02},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15a.pdf},
  projects = {PUGS},
  title = {An Efficient Implementation of Game Provenance in DBMS},
  venueshort = {Techreport},
  year = {2015},
  bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15a.pdf}
}

details

Towards Constraint-based Explanations for Answers and Non-Answers
Boris Glavic, Sven Köhler, Sean Riddle and Bertram Ludäscher
Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (2015).

@inproceedings{GK15,
  author = {Glavic, Boris and K\"{o}hler, Sven and Riddle, Sean and Lud\"{a}scher, Bertram},
  booktitle = {Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance},
  isworkshop = {true},
  keywords = {Provenance;Missing Answers;Summarization;Datalog;Game Provenance; PUGS},
  pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GK15.pdf},
  projects = {PUGS},
  slideurl = {http://www.slideshare.net/lordPretzel/2015-ta-ppwhynotpptx},
  title = {Towards Constraint-based Explanations for Answers and Non-Answers},
  venueshort = {TaPP},
  year = {2015},
  bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/GK15.pdf}
}

details