Seokki Lee, Ph.D. Student (Alumnus)
I have a master in Computer Science and Engineering from Hanyang University in 2009 and Engineering Management from California State University Northridge in 2014. I joined the IIT DBGroup from 2014 Fall after several years of work experience. I joined University of Cincinnati in Fall 2020 as an Assistant Professor. You can find my new homepage here and my research group's webpage here
Awards
- IIT Dissertation Fellowship (2019)
- IEEE ICDE Travel Grant (2017)
Teaching
I have been TA for the following courses:- 2019 Spring: CS525 - Advanced Database Organization
- 2017 Fall: CS525 - Advanced Database Organization
- 2017 Spring: CS525 - Advanced Database Organization
- 2016 Spring: CS520 - Data Integration, Warehousing, and Provenance
Research Projects
My main research interests are Databases, Data Exchange and Integration, and Data Provenance. I have been involved in the following research projects:- GProM - A database-independent middleware for computing the provenance of queries, updates, and transactions
- PUGS - PUGS is a unified framework for capturing why and why-not provenance of Datalog queries with negation and for automatic generation of concise provenance summaries.
- Vagabond - Automatic generation of explanations for data exchange errors.
Publications
-
Hybrid Query and Instance Explanations and Repairs
Seokki Lee, Boris Glavic, Adriane Chapman and Bertram Ludäscher
Companion Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023 (2023), pp. 1559–1562.@inproceedings{LG23, author = {Lee, Seokki and Glavic, Boris and Chapman, Adriane and Lud\"ascher, Bertram}, editor = {Ding, Ying and Tang, Jie and Sequeda, Juan F. and Aroyo, Lora and Castillo, Carlos and Houben, Geert-Jan}, title = {Hybrid Query and Instance Explanations and Repairs}, booktitle = {Companion Proceedings of the {ACM} Web Conference 2023, {WWW} 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023}, pages = {1559--1562}, publisher = {{ACM}}, year = {2023}, url = {https://doi.org/10.1145/3543873.3587565}, doi = {10.1145/3543873.3587565}, timestamp = {Wed, 17 May 2023 21:55:45 +0200}, biburl = {https://dblp.org/rec/conf/www/LeeGCL23.bib}, venueshort = {TaPP}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LG23.pdf}, keywords = {Hybrid Explanations; Provenance; Why-not}, bibsource = {dblp computer science bibliography, https://dblp.org} }
-
Debugging Missing Answers for Spark Queries over Nested Data with Breadcrumb
Ralf Diestelkämper, Seokki Lee, Melanie Herschel and Boris Glavic
Proceedings of the VLDB Endowment (Demonstration Track). 14, 12 (2021) , 2731–2734.@article{DL21a, author = {Diestelk{\"a}mper, Ralf and Lee, Seokki and Herschel, Melanie and Glavic, Boris}, keywords = {Provenance; Missing Answers}, title = {Debugging Missing Answers for Spark Queries over Nested Data with Breadcrumb}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, pages = {2731 - 2734}, volume = {14}, issue = {12}, video = {https://www.youtube.com/watch?v=Y0uWqdtWGGw}, pdfurl = {http://vldb.org/pvldb/vol14/p2731-diestelkamper.pdf}, doi = {10.14778/3476311.3476331}, year = {2021}, venueshort = {{PVLDB}} }
-
To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data
Ralf Diestelkämper, Seokki Lee, Melanie Herschel and Boris Glavic
Proceedings of the 46th International Conference on Management of Data (2021), pp. 405–417.@inproceedings{DL21, author = {Diestelk{\"a}mper, Ralf and Lee, Seokki and Herschel, Melanie and Glavic, Boris}, booktitle = {Proceedings of the 46th International Conference on Management of Data}, pages = {405–417}, projects = {}, pdfurl = {https://dl.acm.org/doi/pdf/10.1145/3448016.3457249}, title = {To not miss the forest for the trees - A holistic approach for explaining missing answers over nested data}, doi = {10.1145/3448016.3457249}, video = {https://www.youtube.com/watch?v=q_YCcP5mGIk&list=PL3xUNnH4TdbsfndCMn02BqAAgGB0z7cwq}, keywords = {Provenance; Missing Answers}, venueshort = {SIGMOD}, longversionurl = {https://arxiv.org/pdf/2103.07561}, year = {2021} }
Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven useful, e.g., to debug complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach to produce query-based explanations. It is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting, projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation on Spark, we demonstrate that our approach is the first to scale to large datasets while often finding explanations that existing techniques fail to identify.
-
Approximate Summaries for Why and Why-not Provenance
Seokki Lee, Bertram Ludäscher and Boris Glavic
Proceedings of the VLDB Endowment. 13, 6 (2020) , 912–924.@article{LL20, author = {Lee, Seokki and Lud\"ascher, Bertram and Glavic, Boris}, journal = {Proceedings of the VLDB Endowment}, keywords = {PUGS; Summarization; Sampling; Missing Answers; Datalog}, longversionurl = {https://arxiv.org/pdf/2002.00084}, projects = {PUGS}, number = {6}, pages = {912 - 924}, pdfurl = {http://www.vldb.org/pvldb/vol13/p912-lee.pdf}, title = {{Approximate Summaries for Why and Why-not Provenance}}, venueshort = {PVLDB}, volume = {13}, year = {2020} }
Why and why-not provenance have been studied extensively in recent years. However, why-not provenance and — to a lesser degree — why provenance can be very large, resulting in severe scalability and usability challenges. We introduce a novel approximate summarization technique for provenance to address these challenges. Our approach uses patterns to encode why and why-not provenance concisely. We develop techniques for efficiently computing provenance summaries that balance informativeness, conciseness, and completeness. To achieve scalability, we integrate sampling techniques into provenance capture and summarization. Our approach is the first to both scale to large datasets and to generate comprehensive and meaningful summaries.
-
Why and Why-Not Provenance for Queries with Negation
Seokki Lee
Illinois Institute of Technology.@phdthesis{lee-20-wwnpqn, venueshort = {PhD Thesis}, author = {Lee, Seokki}, keywords = {PUGS; Summarization; Missing Answers; Datalog}, month = may, pdfurl = {https://search-proquest-com.ezproxy.gl.iit.edu/pqdtglobal/docview/2424512806/fulltextPDF/4B76E2738F99473BPQ/1?accountid=28377}, project = {PUGS}, school = {Illinois Institute of Technology}, title = {Why and Why-Not Provenance for Queries with Negation}, year = {2020} }
-
PUG: a framework and practical implementation for why and why-not provenance
Seokki Lee, Bertram Ludäscher and Boris Glavic
The VLDB Journal. 28, 1 (Aug. 2019) , 47—71.@article{LL18, author = {Lee, Seokki and Lud{\"a}scher, Bertram and Glavic, Boris}, date-added = {2018-08-29 19:09:06 -0500}, date-modified = {2018-08-29 19:09:33 -0500}, day = {23}, doi = {10.1007/s00778-018-0518-5}, issn = {0949-877X}, issue = {1}, journal = {The VLDB Journal}, keywords = {Datalog; Provenance; Missing Answers; Semirings; PUGS}, longversionurl = {https://arxiv.org/pdf/1808.05752.pdf}, month = aug, pages = {47---71}, projects = {PUGS}, title = {PUG: a framework and practical implementation for why and why-not provenance}, venueshort = {VLDBJ}, volume = {28}, year = {2019} }
Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency.
-
Query-based Why-not Explanations for Nested Data
Ralf Diestelkämper, Boris Glavic, Melanie Herschel and Seokki Lee
Proceedings of the 11th USENIX Workshop on the Theory and Practice of Provenance (2019).@inproceedings{DG19a, author = {Diestelk\"amper, Ralf and Glavic, Boris and Herschel, Melanie and Lee, Seokki}, booktitle = {Proceedings of the 11th USENIX Workshop on the Theory and Practice of Provenance}, isworkshop = {true}, keywords = {Provenance; Missing Answers}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/DG19.pdf}, title = {Query-based Why-not Explanations for Nested Data}, venueshort = {TaPP}, year = {2019} }
We present the first query-based approach for explaining missing answers to queries over nested relational data which is a common data format used by big data systems such as Apache Spark. Our main contributions are a novel way to define query-based why-not provenance based on repairs to queries and presenting an implementation and preliminary experiments for answering such queries in Spark.
-
GProM - A Swiss Army Knife for Your Provenance Needs
Bahareh Arab, Su Feng, Boris Glavic, Seokki Lee, Xing Niu and Qitian Zeng
IEEE Data Engineering Bulletin. 41, 1 (2018) , 51–62.@article{AF18, author = {Arab, Bahareh and Feng, Su and Glavic, Boris and Lee, Seokki and Niu, Xing and Zeng, Qitian}, bibsource = {dblp computer science bibliography, https://dblp.org}, biburl = {https://dblp.org/rec/bib/journals/debu/ArabFGLNZ17}, journal = {{IEEE} Data Engineering Bulletin}, keywords = {GProM; Provenance; Annotations}, number = {1}, pages = {51--62}, pdfurl = {http://sites.computer.org/debull/A18mar/p51.pdf}, projects = {GProM; Reenactment}, timestamp = {Fri, 02 Mar 2018 18:50:49 +0100}, title = {{GProM} - {A} Swiss Army Knife for Your Provenance Needs}, venueshort = {Data Eng. Bull.}, volume = {41}, year = {2018}, bdsk-url-1 = {http://sites.computer.org/debull/A18mar/p51.pdf} }
-
Provenance Summaries for Answers and Non-Answers
Seokki Lee, Bertram Ludäscher and Boris Glavic
Proceedings of the VLDB Endowment (Demonstration Track). 11, 12 (2018) , 1954–1957.@article{LGG18, author = {Lee, Seokki and Lud{\"{a}}scher, Bertram and Glavic, Boris}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, keywords = {PUGS; Datalog; Provenance; Missing Answers}, number = {12}, pages = {1954--1957}, pdfurl = {http://www.vldb.org/pvldb/vol11/p1954-lee.pdf}, projects = {PUGS}, title = {Provenance Summaries for Answers and Non-Answers}, venueshort = {{PVLDB}}, volume = {11}, year = {2018} }
-
Integrating Approximate Summarization with Provenance Capture
Seokki Lee, Xing Niu, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (2017).@inproceedings{SN17, author = {Lee, Seokki and Niu, Xing and Lud\"{a}scher, Bertram and Glavic, Boris}, booktitle = {Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance}, isworkshop = {true}, keywords = {Provenance; Datalog; GProM; Missing Answers; Game Provenance; PUGS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SN17.pdf}, projects = {GProM; PUGS}, title = {Integrating Approximate Summarization with Provenance Capture}, venueshort = {TaPP}, year = {2017}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SN17.pdf} }
-
Debugging Transactions and Tracking their Provenance with Reenactment
Xing Niu, Boris Glavic, Seokki Lee, Bahareh Arab, Dieter Gawlick, Zhen Hua Liu, Vasudha Krishnaswamy, Su Feng and Xun Zou
Proceedings of the VLDB Endowment (Demonstration Track). 10, 12 (2017) , 1857–1860.@article{NG17, author = {Niu, Xing and Glavic, Boris and Lee, Seokki and Arab, Bahareh and Gawlick, Dieter and Liu, Zhen Hua and Krishnaswamy, Vasudha and Feng, Su and Zou, Xun}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, keywords = {Provenance; GProM; Reenactment; Debugging; Concurrency Control; Reenactment}, number = {12}, pages = {1857--1860}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/XG17.pdf}, projects = {GProM; Reenactment}, title = {Debugging Transactions and Tracking their Provenance with Reenactment}, venueshort = {PVLDB}, volume = {10}, year = {2017}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/XG17.pdf} }
-
Adaptive Schema Databases
William Spoth, Bahareh Arab, Eric S. Chan, Dieter Gawlick, Adel Ghoneimy, Boris Glavic, Beda Hammerschmidt, Oliver Kennedy, Seokki Lee, Zhen Hua Liu, Xing Niu and Ying Yang
Proceedings of the 8th Biennial Conference on Innovative Data Systems (2017).@inproceedings{SA17, author = {Spoth, William and Arab, Bahareh and Chan, Eric S. and Gawlick, Dieter and Ghoneimy, Adel and Glavic, Boris and Hammerschmidt, Beda and Kennedy, Oliver and Lee, Seokki and Liu, Zhen Hua and Niu, Xing and Yang, Ying}, booktitle = {Proceedings of the 8th Biennial Conference on Innovative Data Systems}, keywords = {Schema Evolution; Data Integration}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SA17.pdf}, projects = {Vizier}, title = {{Adaptive Schema Databases}}, venueshort = {CIDR}, year = {2017}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/SA17.pdf} }
-
A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 33rd IEEE International Conference on Data Engineering (2017), pp. 485–496.@inproceedings{LS17, author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, booktitle = {Proceedings of the 33rd IEEE International Conference on Data Engineering}, keywords = {Provenance; Datalog; GProM; Missing Answers; Game Provenance; PUGS}, pages = {485-496}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS17.pdf}, projects = {GProM; PUGS}, title = {A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries}, venueshort = {ICDE}, year = {2017} }
-
Efficiently Computing Provenance Graphs for Queries with Negation
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Technical Report #IIT/CS-DB-2016-03
Illinois Institute of Technology.@techreport{LS16a, author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, date-modified = {2016-10-20 12:15:28 +0000}, institution = {Illinois Institute of Technology}, keywords = {Provenance; Datalog; GProM; Missing Answers}, number = {IIT/CS-DB-2016-03}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16a.pdf}, projects = {GProM; PUGS}, title = {Efficiently Computing Provenance Graphs for Queries with Negation}, venueshort = {Techreport}, year = {2016}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16a.pdf} }
-
Implementing Unified Why- and Why-Not Provenance Through Games
Seokki Lee, Sven Köhler, Bertram Ludäscher and Boris Glavic
Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster) (2016).@inproceedings{LS16, author = {Lee, Seokki and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, booktitle = {Proceedings of the 8th USENIX Workshop on the Theory and Practice of Provenance (Poster)}, isworkshop = {true}, keywords = {Provenance; Game Provenance; Datalog; GProM; Missing Answers; PUGS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16.pdf}, projects = {PUGS}, title = {{Implementing Unified Why- and Why-Not Provenance Through Games}}, venueshort = {TaPP}, year = {2016}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LS16.pdf} }
-
An Efficient Implementation of Game Provenance in DBMS
Seokki Lee, Yuchen Tang, Sven Köhler, Bertram Ludäscher and Boris Glavic
Technical Report #IIT/CS-DB-2015-02
Illinois Institute of Technology.@techreport{LW15a, author = {Lee, Seokki and Tang, Yuchen and K\"{o}hler, Sven and Lud\"{a}scher, Bertram and Glavic, Boris}, date-modified = {2015-10-22 12:15:28 +0000}, institution = {Illinois Institute of Technology}, keywords = {Provenance; Game Provenance; Datalog; GProM; Missing Answers; PUGS}, number = {IIT/CS-DB-2015-02}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15a.pdf}, projects = {PUGS}, title = {An Efficient Implementation of Game Provenance in DBMS}, venueshort = {Techreport}, year = {2015}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15a.pdf} }
-
Automatic Generation and Ranking of Explanations for Mapping Errors
Seokki Lee, Zhen Wang, Boris Glavic and Renée J. Miller
Technical Report #IIT/CS-DB-2015-01
Illinois Institute of Technology.@techreport{LW15, author = {Lee, Seokki and Wang, Zhen and Glavic, Boris and Miller, Ren\'{e}e J.}, date-modified = {2015-08-08 08:34:28 +0000}, institution = {Illinois Institute of Technology}, keywords = {Provenance; Vagabond; Data Exchange}, number = {IIT/CS-DB-2015-01}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15.pdf}, projects = {Vagabond}, title = {Automatic Generation and Ranking of Explanations for Mapping Errors}, venueshort = {Techreport}, year = {2015}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/LW15.pdf} }