iBench
The study of data integration is as old as the field of data management. However, given the maturity of this area it is surprising that rigorous empirical evaluations of research ideas are so scarce. We argue that a stronger focus on empirical work would benefit the integration community as a whole and identify one major roadblock for this work - the lack of comprehensive benchmarks, scenario generators, and publicly available implementations of quality measures. This makes it difficult to compare integration solutions, understand their generality, and understand their performance for different application scenarios. Based on this observation we discuss the requirements for such benchmarks. We argue that the major abstractions used in reasoning about integration problems have undergone a major convergence in the last decade and that this convergence is enabling the application of robust empirical methods to integration problems.
In the iBench project we develop an open source metadata generator (available on github) for creating arbitrarily large and complex mappings, schemas and schema constraints. iBench can be used with a data generator to efficiently generate realistic data integration scenarios with varying degrees of size and complexity. iBench can be used to create benchmarks for different integration tasks including (virtual) data integration, data exchange, schema evolution, mapping operators like composition and inversion, and schema matching.
Our first prototype implementation is based on STBenchmark, the first benchmark for schema mapping systems. Given a configuration file the benchmark generates a complete mapping scenarion (schemas, data, and mappings) by combining randomized instances of mapping primitives (e.g., vertical partitioning or de-normalization) into a complex scenario. In a first step, we have addressed several shortcomings of this benchmark (.e.g, no sharing of schema elements between mapping primitives and no support for logical mapping languages such as st-tgds or SO tgds). Noteworthy new features are:
- Support to generate st-tgds and SO tgds.
- Support for arbitrary Skolem functions (SO tgds) and various Skolemization modes (ways to generate skolem arguments).
- Simulating some cases of mapping composition (in terms of generated skolem arguments).
- Sharing of source and target schema elements between multiple instances of mapping primitives.
- Generation of functional dependencies (including PKs).
Links
- Official iBench webpage - University of Toronto
- Github repository
- Public repository for configuration files and integration scenarios
Collaborators
- Patricia C. Arocena - TD
- Radu Ciucanu - Université d'Orléans
- Renée J. Miller - Northeastern University
Publications
-
Benchmarking Data Curation Systems
Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti and Donatello Santoro
IEEE Data Engineering Bulletin. 39, 2 (2016) , 47–62.@article{AGG16, author = {Arocena, Patricia C. and Glavic, Boris and Mecca, Giansalvatore and Miller, Ren{\'{e}}e J. and Papotti, Paolo and Santoro, Donatello}, journal = {{IEEE} Data Engineering Bulletin}, keywords = {iBench; BART}, number = {2}, pages = {47--62}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AGG16.pdf}, projects = {iBench; BART}, title = {Benchmarking Data Curation Systems}, venueshort = {Data Eng. Bull.}, volume = {39}, year = {2016} }
-
The iBench Integration Metadata Generator
Patricia C. Arocena, Boris Glavic, Radu Ciucanu and Renée J. Miller
University of Toronto.@techreport{AG15, author = {Arocena, Patricia C. and Glavic, Boris and Ciucanu, Radu and Miller, Ren\'{e}e J.}, institution = {University of Toronto}, keywords = {iBench; Data Exchange; Data Integration}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG15.pdf}, projects = {iBench}, title = {The iBench Integration Metadata Generator}, venueshort = {Techreport}, year = {2015}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG15.pdf} }
-
Gain Control over your Integration Evaluations
Patricia C. Arocena, Radu Ciucanu, Boris Glavic and Renée J. Miller
Proceedings of the VLDB Endowment (Demonstration Track). 8, 12 (2015) , 1960–1971.@article{AC15, author = {Arocena, Patricia C. and Ciucanu, Radu and Glavic, Boris and Miller, Ren{\'e}e J.}, journal = {Proceedings of the VLDB Endowment (Demonstration Track)}, keywords = {iBench; Data Exchange; Data Integration; Benchmarking}, number = {12}, pages = {1960 - 1971}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AC15.pdf}, projects = {iBench}, title = {{Gain Control over your Integration Evaluations}}, venueshort = {PVLDB}, volume = {8}, year = {2015} }
-
The iBench Integration Metadata Generator
Patricia C. Arocena, Boris Glavic, Radu Ciucanu and Renée J. Miller
Proceedings of the VLDB Endowment. 9, 3 (2015) , 108–119.@article{AG15c, author = {Arocena, Patricia C. and Glavic, Boris and Ciucanu, Radu and Miller, Ren\'{e}e J.}, journal = {Proceedings of the VLDB Endowment}, keywords = {iBench; Data Exchange; Data Integration; Benchmarking}, number = {3}, pages = {108-119}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG15.pdf}, projects = {iBench}, slideurl = {http://www.slideshare.net/lordPretzel/2016-vldb-the-ibench-integration-metadata-generator}, title = {{The iBench Integration Metadata Generator}}, venueshort = {PVLDB}, volume = {9}, year = {2015} }
-
iBench First Cut
Patricia C. Arocena, Mariana D’Angelo, Boris Glavic and Renée J. Miller
University of Toronto.@techreport{AD13, author = {Arocena, Patricia C. and D'Angelo, Mariana and Glavic, Boris and Miller, Ren{\'e}e J.}, date-added = {2014-01-08 17:16:53 +0000}, date-modified = {2014-01-08 17:17:42 +0000}, institution = {University of Toronto}, keywords = {iBench; Data Exchange; Data Integration; Benchmarking}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AD13.pdf}, projects = {iBench}, title = {iBench First Cut}, venueshort = {Techreport}, year = {2013}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AD13.pdf} }
-
Value Invention for Data Exchange
Patricia C. Arocena, Boris Glavic and Renée J. Miller
Proceedings of the 39th International Conference on Management of Data (2013), pp. 157–168.@inproceedings{AG13, author = {Arocena, Patricia C. and Glavic, Boris and Miller, Ren\'{e}e J.}, booktitle = {Proceedings of the 39th International Conference on Management of Data}, date-added = {2013-05-13 14:03:56 +0000}, date-modified = {2013-06-02 19:45:38 +0000}, keywords = {Data Exchange; iBench}, pages = {157-168}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG13.pdf}, projects = {iBench}, slideurl = {http://www.slideshare.net/lordPretzel/sigmod-2013-patricias-talk-on-value-invention}, title = {Value Invention for Data Exchange}, venueshort = {SIGMOD}, year = {2013}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG13.pdf} }
The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.