Important Dates
- 01/30 Deadline: Form groups
- 01/30 10:00am until 02/01 1pm Select literature review papers
- 04/25 Deadline: Literature review reports due
- 05/04 8:00am-10am Final Exam, SB104
Schedule
The course schedule and linked slides will be updated over time.
For convenience, here is a combined versions of all slides and all handouts (6 slides per page).
01/12 | 0. Overview | slides or handout (6 slides per page) |
01/14 | 1. Introduction | slides or handout (6 slides per page) |
01/19 | ||
01/26 | 2. Data Preparation and Cleaning | slides or handout (6 slides per page) |
01/28 | ||
02/02 | ||
02/04 | 3. Schema matching and Mapping | slides or handout (6 slides per page) |
02/09 | ||
02/11 | ||
02/16 | ||
02/18 | 4. Virtual Data Integration | |
02/23 | ||
02/25 | ||
03/02 | ||
03/04 | 5. Data Exchange | slides or handout (6 slides per page) |
03/09 | ||
03/11 | ||
03/23 | 6. Data Warehousing | |
03/25 | ||
03/30 | ||
04/01 | ||
04/06 | ||
04/08 | 7. Big Data Analytics | |
04/13 | ||
04/15 | ||
04/20 | 8. Data Provenance | |
04/22 | ||
04/27 | ||
04/29 | ||
TBA | Final Exam | Info |
Presentation Schedule
Group | Date | Paper |
1 | 03/02 #1 | Borkar, V., Deshmukh, K., and Sarawagi, S.. Automatic segmentation of text into structured records. In ACM SIGMOD record, volume 30, 175-186, 2001. |
20 | 03/04 #1 | P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746-755, 2007. |
21 | 03/04 #2 | Kukich, K.. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377-439, 1992. |
26 | 03/04 #3 | Mohamed Yakout, Laure Berti-Equille, and Ahmed K Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. SIGMOD, 553-564, 2013. |
3 | 03/09 #1 | Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007;19(1):1–16. |
16 | 03/09 #2 | Monge, A. and Elkan, C. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD, 1997. |
12 | 03/11 #1 | Doan, A., Domingos, P., and Levy, A. Y. Learning source description for data integration. In Webdb, 81-86, 2000. |
17 | 03/11 #2 | Bergamaschi, S., Castano, S., and Vincini, M. Semantic integration of semistructured and structured data sources. ACM Sigmod Record, 28(1):54-59, 1999. |
24 | 03/23 #1 | Goldstein, J. and Larson, P. Å.. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record, 30(2):331-342, 2001. |
4 | 03/23 #2 | Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L.. Data Exchange: Semantics and Query Answering. Theoretical Computer Science, 336(1):89-124, 2005. |
6 | 03/25 #1 | Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393-404, ACM, 1997. |
13 | 03/25 #2 | Muralikrishna, M.. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In VLDB '92: proceedings of the 18th international conference on very large data bases, 91-102, 1992. |
18 | 03/30 #1 | Chaudhuri, S. and Narasayya, V.. Self-tuning database systems: a decade of progress. In Proceedings of the 33rd international conference on very large data bases, 3-14, VLDB Endowment, 2007. |
23 | 03/30 #2 | Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H.. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29-53, 1997. |
7 | 04/01 #1 | Vansummeren, S. and Cheney, J.. Recording Provenance for SQL Queries and Updates. IEEE Data Engineering Bulletin, 30(4):29--37, 2007. |
8 | 04/01 #2 | Bhagwat, D., Chiticariu, L., Tan, W.-C., and Vijayvargiya, G.. An Annotation Management System for Relational Databases. In Vldb '04: proceedings of the 30th international conference on very large data bases, 900--911, 2004. |
9 | 04/06 #1 | Gehani, A. and Tariq, D.. Spade: support for provenance auditing in distributed environments. SRI International, 2011. |
2 | 04/06 #2 | Dean, J. and Ghemawat, S.. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on symposium on opearting systems design and implementation - volume 6, OSDI, 2004. |
5 | 04/08 #1 | Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., and others. Spanner: google's globally-distributed database. OSDI, page 1, 2012. |
10 | 04/08 #2 | Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., and Zhuang, L.. Nectar: automatic management of data and computation in datacenters.. In OSDI, 75--88, 2010. |
11 | 04/13 #1 | Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., and Schad, J.. Hadoop++: making a yellow elephant run like a cheetah. PVLDB, 3(1):518-529, 2010. |
14 | 04/13 #2 | Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G.. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 acm sigmod international conference on management of data, 135--146, ACM, 2010. |
15 | 04/15 #1 | Lim, H., Herodotou, H., and Babu, S.. Stubby: a transformation-based optimizer for mapreduce workflows. Proceedings of the VLDB Endowment, 5(11):1196--1207, 2012. |
19 | 04/15 #2 | Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T.. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010. |
22 | 04/20 #1 | Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R.. Hive-a petabyte scale data warehouse using hadoop. In Data engineering (icde), 2010 ieee 26th international conference on, 996--1005, IEEE, 2010. |
25 | 04/20 #2 | Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinlaender, A., and Markl, V.. Applying stratosphere for big data analytics.. In Btw, 507--510, 2013. |