Explanations beyond Provenance
In many data analysis applications there is a need to explain why a surprising or interesting result was produced by a query. Past work in this area has mostly focused on data provenance, i.e., the input data that was used to derive the result of interest. However, it is often the case that the provenance of a query result only contains a small fraction of the information that is relevant for explaining an answer. In this work we explore novel types of explanations that are not (just) based on data provenance.
CaJaDe
We propose a new approach for explaining interesting query results by augmenting provenance with information from other related tables in the database. Specifically, given a schema graph that encodes the semantic relationships of tables in a database schema, we devise algorithms for enriching the provenance of a query result by joining it with data from tables that are connected to tables from the provenance in the schema graph. Furthermore, we summarize the results of such joins to generate rich, high-level patterns as explanations.
- CaJaDe is available as open source on github: https://github.com/IITDBGroup/cajade
Cape
Provenance and intervention-based techniques have been used to explain surprising outcomes of aggregate queries based on the outcome’s provenance. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue in a year can be explained by an increase in his publication in another venue in the same year. In this project we investigate how to mine patterns that describe inherent trends in the data and to use these patterns to identify potential causes for an outcome of interest.
As an initial contribution we have developed a novel system called Cape (Counterbalancing with Aggregation Patterns for Explanations) for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We have developed efficient methods for mining such aggregate regression patterns (ARPs) and have demonstrated how to use ARPs to generate and rank explanations.
- Cape is available as open source on github: https://github.com/IITDBGroup/cape
Collaborators
- Sudeepa Roy - Duke University
- Zhengjie Miao - Duke University
Publications
-
Hybrid Query and Instance Explanations and Repairs
Seokki Lee, Boris Glavic, Adriane Chapman and Bertram Ludäscher
Companion Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023 (2023), pp. 1559–1562.@inproceedings{LG23, author = {Lee, Seokki and Glavic, Boris and Chapman, Adriane and Lud\"ascher, Bertram}, editor = {Ding, Ying and Tang, Jie and Sequeda, Juan F. and Aroyo, Lora and Castillo, Carlos and Houben, Geert-Jan}, title = {Hybrid Query and Instance Explanations and Repairs}, booktitle = {Companion Proceedings of the {ACM} Web Conference 2023, {WWW} 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023}, pages = {1559--1562}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3543873.3587565}, doi = {10.1145/3543873.3587565}, timestamp = {Wed, 17 May 2023 21:55:45 +0200}, biburl = {https://dblp.org/rec/conf/www/LeeGCL23.bib}, venueshort = {TaPP}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LG23.pdf}, keywords = {Hybrid Explanations; Provenance; Why-not}, bibsource = {dblp computer science bibliography, https://dblp.org} }
-
Effect of Pre-processing Data on Fairness and Fairness Debugging Using Gopher
Mousam Sarkar
Illinois Institute of Technology.@mastersthesis{S23, author = {Sarkar, Mousam}, date-added = {2018-08-20 18:55:49 +0000}, date-modified = {2018-08-20 18:55:49 +0000}, keywords = {Explanations; Debugging; Fairness}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/S23.pdf}, school = {Illinois Institute of Technology}, title = {{Effect of Pre-processing Data on Fairness and Fairness Debugging Using Gopher}}, venueshort = {Master Thesis}, year = {2022} }
-
Generating Interpretable Data-Based Explanations for Fairness Debugging using Gopher
Jiongli Zhu, Romila Pradhan, Boris Glavic and Babak Salimi
Proceedings of the 48th International Conference on Management of Data (SIGMOD) (Demonstration Track) (2022), pp. 2433–2436.@inproceedings{ZP22, author = {Zhu, Jiongli and Pradhan, Romila and Glavic, Boris and Salimi, Babak}, booktitle = {Proceedings of the 48th International Conference on Management of Data ({SIGMOD}) (Demonstration Track)}, keywords = {Bias; Explanations}, pages = {2433--2436}, projects = {}, pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3514221.3520170}, url = {https://doi.org/10.1145/3514221.3520170}, doi = {10.1145/3514221.3520170}, video = {https://youtu.be/0gmQFLwpoAI}, title = {Generating Interpretable Data-Based Explanations for Fairness Debugging using Gopher}, venueshort = {SIGMOD}, year = {2022} }
-
CaJaDE: Explaining Query Results by Augmenting Provenance with Context
Chenjie Li, Juseung Lee, Zhengjie Miao, Boris Glavic and Sudeepa Roy
Proceedings of the VLDB Endowment (Demonstration Track). 15, 12 (2022) , 3594–3597.@article{LL22a, author = {Li, Chenjie and Lee, Juseung and Miao, Zhengjie and Glavic, Boris and Roy, Sudeepa}, keywords = {Provenance; Explanations}, title = {CaJaDE: Explaining Query Results by Augmenting Provenance with Context}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, projects = {Explanations beyond Provenance}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LL22.pdf}, volume = {15}, number = {12}, pages = {3594--3597}, doi = {10.14778/3554821.3554852}, year = {2022}, venueshort = {{PVLDB}} }
-
Enhancing Explanation Generation in the CaJaDE system using Interactive User Feedback
Juseung Lee
Illinois Institute of Technology.@mastersthesis{L22, author = {Lee, Juseung}, date-added = {2018-08-20 18:55:49 +0000}, date-modified = {2018-08-20 18:55:49 +0000}, keywords = {Provenance; Explanations}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/L22.pdf}, projects = {Explanations beyond Provenance}, school = {Illinois Institute of Technology}, title = {{Enhancing Explanation Generation in the CaJaDE system using Interactive User Feedback}}, venueshort = {Master Thesis}, year = {2022} }
-
Interpretable Data-Based Explanations for Fairness Debugging
Babak Salimi, Romila Pradhan, Jiongli Zhu and Boris Glavic
Proceedings of the 48th International Conference on Management of Data (2022), pp. 247–261.@inproceedings{SP22, author = {Salimi, Babak and Pradhan, Romila and Zhu, Jiongli and Glavic, Boris}, booktitle = {Proceedings of the 48th International Conference on Management of Data}, pages = {247--261}, title = {Interpretable Data-Based Explanations for Fairness Debugging}, video = {https://youtu.be/wSDgmONVvF4}, url = {https://doi.org/10.1145/3514221.3517886}, pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3514221.3517886}, doi = {10.1145/3514221.3517886}, keywords = {Explanations; Fairness}, longversionurl = {https://arxiv.org/pdf/2112.09745}, venueshort = {{SIGMOD}}, year = {2022} }
-
Putting Things into Context: Rich Explanations for Query Answers using Join Graphs
Chenjie Li, Zhengjie Miao, Qitian Zeng, Boris Glavic and Sudeepa Roy
Proceedings of the 46th International Conference on Management of Data (2021), pp. 1051–1063.@inproceedings{LM21, author = {Li, Chenjie and Miao, Zhengjie and Zeng, Qitian and Glavic, Boris and Roy, Sudeepa}, booktitle = {Proceedings of the 46th International Conference on Management of Data}, pages = {1051–1063}, projects = {Explanations beyond Provenance}, title = {Putting Things into Context: Rich Explanations for Query Answers using Join Graphs}, pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3448016.3459246}, doi = {10.1145/3448016.3459246}, keywords = {Provenance; Explanations}, venueshort = {SIGMOD}, reproducibility = {https://github.com/IITDBGroup/CaJaDe}, video = {https://www.youtube.com/watch?v=puhCAnFuPR4&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq}, longversionurl = {https://arxiv.org/pdf/2103.15797}, year = {2021} }
In many data analysis applications there is a need to explain why a surprising or interesting result was produced by a query. Previous approaches to explaining results have directly or indirectly relied on data provenance, i.e., input tuples contributing to the result(s) of interest. However, some information that is relevant for explaining an answer may not be contained in the provenance. We propose a new approach for explaining query results by augmenting provenance with information from other related tables in the database. Using a suite of optimization techniques, we demonstrate experimentally using real datasets and through a user study that our approach produces meaningful results and is efficient.
-
CAPE: Explaining Outliers by Counterbalancing
Zhengjie Miao, Qitian Zeng, Chenjie Li, Boris Glavic, Oliver Kennedy and Sudeepa Roy
Proceedings of the VLDB Endowment (Demonstration Track). 12, 12 (2019) , 1806–1809.@article{MZ19a, author = {Miao, Zhengjie and Zeng, Qitian and Li, Chenjie and Glavic, Boris and Kennedy, Oliver and Roy, Sudeepa}, date-modified = {2019-08-02 09:14:13 -0500}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, keywords = {Outliers; Intervention; Cape; Explanations}, pdfurl = {http://www.vldb.org/pvldb/vol12/p1806-miao.pdf}, projects = {Explanations beyond Provenance}, pages = {1806-1809}, volume = {12}, issue = {12}, doi = {10.14778/3352063.3352071}, title = {CAPE: Explaining Outliers by Counterbalancing}, venueshort = {{PVLDB}}, year = {2019} }
In this demonstration we showcase Cape, a system that ex- plains surprising aggregation outcomes. In contrast to previous work which relies exclusively on provenance, Cape applies a novel approach for explaining outliers in aggregation queries through counterbalancing (outliers in the opposite direction). The foundation of our approach are aggregate regression patterns (ARPs) based on which we defined outliers, and an efficient explanation generation algorithm that utilizes these patterns. In the demonstration, the audience can run aggregation queries over real world datasets, and browse the patterns and explanations returned by Cape for outliers in the result of such queries.
-
Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances
Zhengjie Miao, Qitian Zeng, Boris Glavic and Sudeepa Roy
Proceedings of the 44th International Conference on Management of Data (2019), pp. 485–502.@inproceedings{MZ19, author = {Miao, Zhengjie and Zeng, Qitian and Glavic, Boris and Roy, Sudeepa}, booktitle = {Proceedings of the 44th International Conference on Management of Data}, keywords = {Provenance; Explanations; Cape}, pages = {485--502}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/MZ19.pdf}, reproducibility = {https://hub.docker.com/repository/docker/iitdbgroup/2019-sigmod-reproducibility-cape}, doi = {10.1145/3299869.3300066}, video = {https://av.tib.eu/media/42903}, projects = {Explanations beyond Provenance}, slideurl = {https://www.slideshare.net/lordPretzel/2019-sigmod-going-beyond-provenance-explaining-query-answers-with-patternbased-counterbalances}, title = {Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances}, venueshort = {SIGMOD}, year = {2019} }
Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach.