CS 520: Data Integration, Warehousing, and Provenance - 2015 Spring

Important Dates

Schedule

The course schedule and linked slides will be updated over time.

For convenience, here is a combined versions of all slides and all handouts (6 slides per page).

01/120. Overviewslides or handout (6 slides per page)
01/141. Introductionslides or handout (6 slides per page)
01/19
01/262. Data Preparation and Cleaningslides or handout (6 slides per page)
01/28
02/02
02/043. Schema matching and Mappingslides or handout (6 slides per page)
02/09
02/11
02/16
02/184. Virtual Data Integration
02/23
02/25
03/02
03/045. Data Exchangeslides or handout (6 slides per page)
03/09
03/11
03/236. Data Warehousing
03/25
03/30
04/01
04/06
04/087. Big Data Analytics
04/13
04/15
04/208. Data Provenance
04/22
04/27
04/29
TBAFinal ExamInfo

Presentation Schedule

GroupDatePaper
103/02 #1Borkar, V., Deshmukh, K., and Sarawagi, S.. Automatic segmentation of text into structured records. In ACM SIGMOD record, volume 30, 175-186, 2001.
2003/04 #1P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746-755, 2007.
2103/04 #2Kukich, K.. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377-439, 1992.
2603/04 #3Mohamed Yakout, Laure Berti-Equille, and Ahmed K Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. SIGMOD, 553-564, 2013.
303/09 #1Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007;19(1):1–16.
1603/09 #2Monge, A. and Elkan, C. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD, 1997.
1203/11 #1Doan, A., Domingos, P., and Levy, A. Y. Learning source description for data integration. In Webdb, 81-86, 2000.
1703/11 #2Bergamaschi, S., Castano, S., and Vincini, M. Semantic integration of semistructured and structured data sources. ACM Sigmod Record, 28(1):54-59, 1999.
2403/23 #1Goldstein, J. and Larson, P. Å.. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record, 30(2):331-342, 2001.
403/23 #2Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L.. Data Exchange: Semantics and Query Answering. Theoretical Computer Science, 336(1):89-124, 2005.
603/25 #1Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393-404, ACM, 1997.
1303/25 #2Muralikrishna, M.. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In VLDB '92: proceedings of the 18th international conference on very large data bases, 91-102, 1992.
1803/30 #1Chaudhuri, S. and Narasayya, V.. Self-tuning database systems: a decade of progress. In Proceedings of the 33rd international conference on very large data bases, 3-14, VLDB Endowment, 2007.
2303/30 #2Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H.. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29-53, 1997.
704/01 #1Vansummeren, S. and Cheney, J.. Recording Provenance for SQL Queries and Updates. IEEE Data Engineering Bulletin, 30(4):29--37, 2007.
804/01 #2Bhagwat, D., Chiticariu, L., Tan, W.-C., and Vijayvargiya, G.. An Annotation Management System for Relational Databases. In Vldb '04: proceedings of the 30th international conference on very large data bases, 900--911, 2004.
904/06 #1Gehani, A. and Tariq, D.. Spade: support for provenance auditing in distributed environments. SRI International, 2011.
204/06 #2Dean, J. and Ghemawat, S.. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on symposium on opearting systems design and implementation - volume 6, OSDI, 2004.
504/08 #1Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., and others. Spanner: google's globally-distributed database. OSDI, page 1, 2012.
1004/08 #2Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., and Zhuang, L.. Nectar: automatic management of data and computation in datacenters.. In OSDI, 75--88, 2010.
1104/13 #1Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., and Schad, J.. Hadoop++: making a yellow elephant run like a cheetah. PVLDB, 3(1):518-529, 2010.
1404/13 #2Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G.. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 acm sigmod international conference on management of data, 135--146, ACM, 2010.
1504/15 #1Lim, H., Herodotou, H., and Babu, S.. Stubby: a transformation-based optimizer for mapreduce workflows. Proceedings of the VLDB Endowment, 5(11):1196--1207, 2012.
1904/15 #2Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T.. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010.
2204/20 #1Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R.. Hive-a petabyte scale data warehouse using hadoop. In Data engineering (icde), 2010 ieee 26th international conference on, 996--1005, IEEE, 2010.
2504/20 #2Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinlaender, A., and Markl, V.. Applying stratosphere for big data analytics.. In Btw, 507--510, 2013.