CS 520 - schedule

Important Dates

01/30 Deadline: Form groups
01/30 10:00am until 02/01 1pm Select literature review papers

04/25 Deadline: Literature review reports due

05/04 8:00am-10am Final Exam, SB104

Schedule

The course schedule and linked slides will be updated over time.

For convenience, here is a combined versions of all slides and all handouts (6 slides per page).

01/12 0. Overview slides or handout (6 slides per page)

01/14 1. Introduction slides or handout (6 slides per page)

01/19

01/26 2. Data Preparation and Cleaning slides or handout (6 slides per page)

01/28

02/02

02/04 3. Schema matching and Mapping slides or handout (6 slides per page)

02/09

02/11

02/16

02/18 4. Virtual Data Integration

02/23

02/25

03/02

03/04 5. Data Exchange slides or handout (6 slides per page)

03/09

03/11

03/23 6. Data Warehousing

03/25

03/30

04/01

04/06

04/08 7. Big Data Analytics

04/13

04/15

04/20 8. Data Provenance

04/22

04/27

04/29

TBA Final Exam Info

Presentation Schedule

Group Date Paper

1 03/02 #1 Borkar, V., Deshmukh, K., and Sarawagi, S.. Automatic segmentation of text into structured records. In ACM SIGMOD record, volume 30, 175-186, 2001.

20 03/04 #1 P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746-755, 2007.

21 03/04 #2 Kukich, K.. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377-439, 1992.

26 03/04 #3 Mohamed Yakout, Laure Berti-Equille, and Ahmed K Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. SIGMOD, 553-564, 2013.

3 03/09 #1 Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007;19(1):1–16.

16 03/09 #2 Monge, A. and Elkan, C. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD, 1997.

12 03/11 #1 Doan, A., Domingos, P., and Levy, A. Y. Learning source description for data integration. In Webdb, 81-86, 2000.

17 03/11 #2 Bergamaschi, S., Castano, S., and Vincini, M. Semantic integration of semistructured and structured data sources. ACM Sigmod Record, 28(1):54-59, 1999.

24 03/23 #1 Goldstein, J. and Larson, P. Å.. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record, 30(2):331-342, 2001.

4 03/23 #2 Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L.. Data Exchange: Semantics and Query Answering. Theoretical Computer Science, 336(1):89-124, 2005.

6 03/25 #1 Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393-404, ACM, 1997.

13 03/25 #2 Muralikrishna, M.. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In VLDB '92: proceedings of the 18th international conference on very large data bases, 91-102, 1992.

18 03/30 #1 Chaudhuri, S. and Narasayya, V.. Self-tuning database systems: a decade of progress. In Proceedings of the 33rd international conference on very large data bases, 3-14, VLDB Endowment, 2007.

23 03/30 #2 Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H.. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29-53, 1997.

7 04/01 #1 Vansummeren, S. and Cheney, J.. Recording Provenance for SQL Queries and Updates. IEEE Data Engineering Bulletin, 30(4):29--37, 2007.

8 04/01 #2 Bhagwat, D., Chiticariu, L., Tan, W.-C., and Vijayvargiya, G.. An Annotation Management System for Relational Databases. In Vldb '04: proceedings of the 30th international conference on very large data bases, 900--911, 2004.

9 04/06 #1 Gehani, A. and Tariq, D.. Spade: support for provenance auditing in distributed environments. SRI International, 2011.

2 04/06 #2 Dean, J. and Ghemawat, S.. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on symposium on opearting systems design and implementation - volume 6, OSDI, 2004.

5 04/08 #1 Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., and others. Spanner: google's globally-distributed database. OSDI, page 1, 2012.

10 04/08 #2 Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., and Zhuang, L.. Nectar: automatic management of data and computation in datacenters.. In OSDI, 75--88, 2010.

11 04/13 #1 Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., and Schad, J.. Hadoop++: making a yellow elephant run like a cheetah. PVLDB, 3(1):518-529, 2010.

14 04/13 #2 Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G.. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 acm sigmod international conference on management of data, 135--146, ACM, 2010.

15 04/15 #1 Lim, H., Herodotou, H., and Babu, S.. Stubby: a transformation-based optimizer for mapreduce workflows. Proceedings of the VLDB Endowment, 5(11):1196--1207, 2012.

19 04/15 #2 Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T.. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010.

22 04/20 #1 Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R.. Hive-a petabyte scale data warehouse using hadoop. In Data engineering (icde), 2010 ieee 26th international conference on, 996--1005, IEEE, 2010.

25 04/20 #2 Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinlaender, A., and Markl, V.. Applying stratosphere for big data analytics.. In Btw, 507--510, 2013.

01/12	0. Overview	slides or handout (6 slides per page)
01/14	1. Introduction	slides or handout (6 slides per page)
01/19
01/26	2. Data Preparation and Cleaning	slides or handout (6 slides per page)
01/28
02/02
02/04	3. Schema matching and Mapping	slides or handout (6 slides per page)
02/09
02/11
02/16
02/18	4. Virtual Data Integration
02/23
02/25
03/02
03/04	5. Data Exchange	slides or handout (6 slides per page)
03/09
03/11
03/23	6. Data Warehousing
03/25
03/30
04/01
04/06
04/08	7. Big Data Analytics
04/13
04/15
04/20	8. Data Provenance
04/22
04/27
04/29
TBA	Final Exam	Info

Group	Date	Paper
1	03/02 #1	Borkar, V., Deshmukh, K., and Sarawagi, S.. Automatic segmentation of text into structured records. In ACM SIGMOD record, volume 30, 175-186, 2001.
20	03/04 #1	P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746-755, 2007.
21	03/04 #2	Kukich, K.. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377-439, 1992.
26	03/04 #3	Mohamed Yakout, Laure Berti-Equille, and Ahmed K Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. SIGMOD, 553-564, 2013.
3	03/09 #1	Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007;19(1):1–16.
16	03/09 #2	Monge, A. and Elkan, C. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD, 1997.
12	03/11 #1	Doan, A., Domingos, P., and Levy, A. Y. Learning source description for data integration. In Webdb, 81-86, 2000.
17	03/11 #2	Bergamaschi, S., Castano, S., and Vincini, M. Semantic integration of semistructured and structured data sources. ACM Sigmod Record, 28(1):54-59, 1999.
24	03/23 #1	Goldstein, J. and Larson, P. Å.. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record, 30(2):331-342, 2001.
4	03/23 #2	Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L.. Data Exchange: Semantics and Query Answering. Theoretical Computer Science, 336(1):89-124, 2005.
6	03/25 #1	Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393-404, ACM, 1997.
13	03/25 #2	Muralikrishna, M.. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In VLDB '92: proceedings of the 18th international conference on very large data bases, 91-102, 1992.
18	03/30 #1	Chaudhuri, S. and Narasayya, V.. Self-tuning database systems: a decade of progress. In Proceedings of the 33rd international conference on very large data bases, 3-14, VLDB Endowment, 2007.
23	03/30 #2	Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H.. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29-53, 1997.
7	04/01 #1	Vansummeren, S. and Cheney, J.. Recording Provenance for SQL Queries and Updates. IEEE Data Engineering Bulletin, 30(4):29--37, 2007.
8	04/01 #2	Bhagwat, D., Chiticariu, L., Tan, W.-C., and Vijayvargiya, G.. An Annotation Management System for Relational Databases. In Vldb '04: proceedings of the 30th international conference on very large data bases, 900--911, 2004.
9	04/06 #1	Gehani, A. and Tariq, D.. Spade: support for provenance auditing in distributed environments. SRI International, 2011.
2	04/06 #2	Dean, J. and Ghemawat, S.. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on symposium on opearting systems design and implementation - volume 6, OSDI, 2004.
5	04/08 #1	Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., and others. Spanner: google's globally-distributed database. OSDI, page 1, 2012.
10	04/08 #2	Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., and Zhuang, L.. Nectar: automatic management of data and computation in datacenters.. In OSDI, 75--88, 2010.
11	04/13 #1	Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., and Schad, J.. Hadoop++: making a yellow elephant run like a cheetah. PVLDB, 3(1):518-529, 2010.
14	04/13 #2	Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G.. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 acm sigmod international conference on management of data, 135--146, ACM, 2010.
15	04/15 #1	Lim, H., Herodotou, H., and Babu, S.. Stubby: a transformation-based optimizer for mapreduce workflows. Proceedings of the VLDB Endowment, 5(11):1196--1207, 2012.
19	04/15 #2	Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T.. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010.
22	04/20 #1	Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R.. Hive-a petabyte scale data warehouse using hadoop. In Data engineering (icde), 2010 ieee 26th international conference on, 996--1005, IEEE, 2010.
25	04/20 #2	Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinlaender, A., and Markl, V.. Applying stratosphere for big data analytics.. In Btw, 507--510, 2013.

CS 520: Data Integration, Warehousing, and Provenance - 2015 Spring

Important Dates

Schedule

Presentation Schedule