CS 520 - reviews

Organization

Students have to form groups of TBA and each group will have to read a research paper, write a report, and give a 20 min presentation about the paper.

Presentation

The presentations will be given in a single block session on April 23 in room SB 111. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. Please read the links below on how to give a good presentation.

Schedule:

09am-12pm: first paper session
12pm-01pm: lunch break
01pm-06pm: second paper session

Detailed schedule:

Time	Group	Title
		Data cleaning and preprocessing
09:00	1	Automating the approximate record-matching process
09:20	8	Text classification from labeled and unlabeled documents using EM
09:40	11	Nadeef: a commodity data cleaning system
10:00	9	Combining quantitative and logical data cleaning
10:20	18	Declarative data cleaning: language, model, and algorithms
10:40	19	Descriptive and prescriptive data cleaning
11:00	23	Declarative support for sensor data cleaning
		Integration, Matching, and Mappings
11:20	14	Efficiently executing information-gathering plans
11:40	7	Clio: Schema Mapping Creation and Data Exchange
12:00		Lunch break
01:00	5	Integrating conflicting data: the role of source dependence
01:20	4	Generic Schema Matching with Cupid
		Data Warehousing
01:40	10	Lenses: an on-demand approach to etl
02:00	16	Algorithms for materialized view design in data warehousing environment
02:20	3	On-line warehouse view maintenance
		Big Data
02:40	6	Asterixdb: a scalable, open source bdms
03:00	12	All roads lead to rome: optimistic recovery for distributed iterative data processing
03:20	15	On the design and scalability of distributed shared-data databases
03:40	17	Spinning fast iterative data flows
04:00	21	A practical scalable distributed B-tree
04:20	20	Hyracks: a flexible and extensible foundation for data-intensive computing
04:40	2	Spark SQL: relational data processing in spark
		Provenance
05:00	22	Looking inside the black-box: capturing data provenance using dynamic instrumentation
05:20	13	Trio: A System for Managing Data, Uncertainty, and Lineage

Report

You will have to write a report that summarizes the content of the paper, explains its main ideas in a way understandable by the other students in the course (they should not have to read another 20 paper to understand what you are writing about), and gives an objective critic of the presented methods or systems. There are no page limitations, but try to avoid lengthy and verbose writing as well as short and incomprehensible reports. Again read some of the links below to get some ideas about how to write a good paper or report.

Late policies:

1-3 days late: -10% points
4-7 days late: -20% points
> 7 days late: 0 points

Time schedule

We expect the deliverables according to the following deadlines:

02/22 - Read the paper and determine the structure of the report and meet with Prof./TA to discuss structure
03/28 - Initial draft of the reports due and first review meeting with Prof./TA
04/15 - First draft of slides due and slide review meeting with Prof./TA
04/18 - Final reports due

Help for writing the report, preparing slides, and giving a talk

How to give a presentation and prepare slides:

Page giving information on how to give a talk and prepare slides.
http://www.eecs.berkeley.edu/~messer/Bad_talk.html - Bulletpoints on how to give a (bad) good talk.
Other slides on how to give a good talk

How to write a scientific article:

Page on how to write an CS article. Also comments on some general writing rules.
Simon Peyton Jones slides and video on how to write a great research paper

Literature Review Papers

The paper for the literature review part of the course are shown below. You will have until 01/30 to build groups. You will be able to vote on papers at 01/30 10:00am until 02/01 1pm. We will send you a link to a form.

Data cleaning and preparation

Shawn R Jeffery, Gustavo Alonso, Michael J Franklin, Wei Hong, and Jennifer Widom. Declarative support for sensor data cleaning In Pervasive computing, 83-100, Springer, 2006.
Chalamalla, A., Ilyas, I. F., Ouzzani, M., and Papotti, P.. Descriptive and prescriptive data cleaning. In Sigmod, , , 2014.
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., and Tang, N.. Nadeef: a commodity data cleaning system. In Proceedings of the 2013 acm sigmod international conference on management of data, 541--552, ACM, , 2013.
Prokoshyna, N., Szlichta, J., Chiang, F., Miller, R. J., and Srivastava, D.. Combining quantitative and logical data cleaning. 2015.
Wang, J. and Tang, N.. Towards dependable data repairing with fixing rules. SIGMOD, 2014.

Entity resolution

Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C., and others. Declarative data cleaning: language, model, and algorithms. VLDB, 2001.
Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T.. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3):103-134, 2000.
Cohen, W. W.. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems (TOIS), 18(3):288-321, 2000.
Verykios, V. S., Elmagarmid, A. K., and Houstis, E. N.. Automating the approximate record-matching process. Information sciences, 126(1):83-98, 2000.

Data fusion

Yan, L. L. and Ozsu, M. T.. Conflict tolerant queries in aurora. In Cooperative information systems, 1999. coopis, 279-290, IEEE, , 1999.
Greco, S., Pontieri, L., and Zumpano, E.. Integrating and managing conflicting data. In Perspectives of system informatics, 349-362, Springer, 2001.
Liu, X., Dong, X. L., Ooi, B. C., and Srivastava, D.. Online data fusion. Proceedings of the VLDB Endowment, 4(11), 2011.
Dong, X. L., Berti-Equille, L., and Srivastava, D.. Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, 2(1):550-561, 2009.

Schema matching and mapping

Larson, J. A., Navathe, S. B., and Elmasri, R.. A theory of attributed equivalence in databases with application to schema integration. Software Engineering, IEEE Transactions on, 15(4):449-463, 1989.
Doan, A., Domingos, P., and Halevy, A. Y.. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD record, volume 30, 509-520, ACM, , 2001.
Embley, D. W., Jackman, D., and Xu, L.. Multifaceted exploitation of metadata for attribute match discovery in information integration.. In Workshop on information integration on the web, 110-117, 2001.
Melnik, S., Garcia-Molina, H., and Rahm, E.. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In Data engineering, 2002. proceedings. 18th international conference on, 117-128, IEEE, , 2002.
Madhavan, J., Bernstein, P. A., and Rahm, E.. Generic Schema Matching with Cupid. The VLDB Journal, page 49-58, 2001.
Fuxman, A., Hernandez, M. A., Ho, H., Miller, R. J., Papotti, P., and Popa, L.. Nested mappings: schema mapping reloaded. In Proceedings of the 32nd international conference on very large data bases, 67-78, VLDB Endowment, , 2006.

Query answering with views and virtual data integration

Levy, A., Rajaraman, A., and Ordille, J.. Querying heterogeneous information sources using source descriptions. 1996.
Pottinger, R. and Halevy, A.. Minicon: a scalable algorithm for answering queries using views. The VLDB Journal, 10(2):182-198, 2001.
Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., Tan, M., and others. Semantic data caching and replacement. In VLDB, volume 96, 330-341, 1996.
Friedman, M. and Weld, D. S.. Efficiently executing information-gathering plans. In In proc. of the int. joint conf. of AI, 1997.
Wiederhold, G.. Mediators in the architecture of future information systems. Computer, 25(3):38-49, 1992.
Calì, A., Calvanese, D., De Giacomo, G., Lenzerini, M., Naggar, P., and Vernacotola, F. Ibis: semantic data integration at work. In Advanced information systems engineering, 79-94, Springer,2003.

Data exchange

Karvounarakis, G., Green, T. J., Ives, Z. G., and Tannen, V.. Collaborative data sharing via update exchange and provenance. ACM Transactions on Database Systems (TODS), 38(3):19, 2013.
Fagin, R., Kolaitis, P. G., and Popa, L.. Data Exchange: Getting to the Core. ACM Transactions on Database Systems (TODS), 30(1):174-210, 2005.
Fuxman, A., Kolaitis, P. G., Miller, R. J., and Tan, W. C.. Peer data exchange. ACM Transactions on Database Systems (TODS), 31(4):1454-1498, 2006.
Mecca, G., Papotti, P., Raunich, S., and Buoncristiano, M.. Concise and expressive mappings with+ spicy. Proceedings of the VLDB Endowment, 2(2):1582-1585, 2009.
Miller, R. J., Haas, L. M., Hernández, M. A., Fagin, R., Popa, L., and Velegrakis, Y.. Clio: Schema Mapping Creation and Data Exange. Conceptual Modeling: Foundations and Applications, page 236, 2009.

Data warehousing

Yang, J., Karlapalem, K., and Li, Q.. Algorithms for materialized view design in data warehousing environment. In Vldb, volume 97, 136-145, 1997.
Chan, C.-Y. and Ioannidis, Y. E.. Bitmap index design and evaluation. ACM SIGMOD Record, 27(2):355-366, 1998.
Wu, K., Otoo, E. J., and Shoshani, A.. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1):1-38, 2006.
Klonatos, Y., Koch, C., Rompf, T., and Chafi, H.. Building efficient query engines in a high-level language. Proceedings of the VLDB Endowment, 7(10):853--864, 2014.
Kaufmann, M., Manjili, A. A., Vagenas, P., Fischer, P. M., Kossmann, D., Faerber, F., and May, N.. Timeline index: a unified data structure for processing queries on temporal data in sap hana. In Proceedings of the 2013 international conference on management of data, 1173--1184, ACM, 2013.
Yang, Y., Meneghetti, N., Fehling, R., Liu, Z. H., and Kennedy, O.. Lenses: an on-demand approach to etl. Proceedings of the VLDB Endowment, 8(12):1578--1589, 2015.
Papadomanolakis, S. and Ailamaki, A.. Autopart: Automating schema design for large scientific databases using data partitioning. 2004.
Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393--404, ACM, , 1997.
Agrawal, D., El Abbadi, A., Singh, A., and Yurek, T.. Efficient view maintenance at data warehouses. In Acm sigmod record, volume 26, 417--427, ACM, , 1997.

Big Data analytics

Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Borkar, V., Bu, Y., Carey, M., Cetindil, I., Cheelangi, M., Faraaz, K., and others. Asterixdb: a scalable, open source bdms. Proceedings of the VLDB Endowment, 7(14):1905--1916, 2014.
Borkar, V., Carey, M., Grover, R., Onose, N., and Vernica, R.. Hyracks: a flexible and extensible foundation for data-intensive computing. In Data engineering (icde), 2011 ieee 27th international conference on, 1151--1162, IEEE, , 2011.
Aguilera, M. K., Golab, W., and Shah, M. A.. A practical scalable distributed B-tree. Proc. VLDB Endow., 1(1):598--609, August 2008.
Schelter, S., Ewen, S., Tzoumas, K., and Markl, V.. All roads lead to rome: optimistic recovery for distributed iterative data processing. In Proceedings of the 22nd acm international conference on conference on information knowledge management, 1919--1928, ACM, 2013.
Ewen, S., Tzoumas, K., Kaufmann, M., and Markl, V.. Spinning fast iterative data flows. Proceedings of the VLDB Endowment, 5(11):1268--1279, 2012.
Binnig, C., Hildenbrand, S., Färber, F., Kossmann, D., Lee, J., and May, N.. Distributed snapshot isolation: global transactions pay globally, local transactions pay locally. The VLDB Journal---The International Journal on Very Large Data Bases, 23(6):987--1011, 2014.
Loesing, S., Pilman, M., Etter, T., and Kossmann, D.. On the design and scalability of distributed shared-data databases. In Proceedings of the 2015 acm sigmod international conference on management of data, 663--676, ACM, , 2015.
Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., and Zaharia, M.. Spark sql: relational data processing in spark. In Proceedings of the acm international conference on management of data (sigmo), 1383--1394, ACM, 2015.

Provenance

Kohler, S., Ludaescher, B., and Zinn, D.. First-order provenance games. In In search of elegance in the theory and practice of computation, page 382--399. Springer, 2013.
Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D.. The Complexity of Causality and Responsibility for Query Answers and non-Answers. Proceedings of the VLDB Endowment, 4(1):34--45, 2010.
Xiao, D. and Eltabakh, M. Y.. Insightnotes: summary-based annotation management in relational databases. , 2014.
Gehani, A. and Tariq, D.. Provenance integration. In 6Th usenix workshop on the theory and practice of provenance (tapp), 2014.
Widom, J.. Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, page 113-148, 2008.
Geerts, F., Kementsietsidis, A., and Milano, D. Mondrian: annotating and querying databases through colors and blocks. In Data engineering, 2006. ICDE. proceedings of the 22nd international conference on, 82--82, IEEE, 2006.
Bidoit, N., Herschel, M., Tzompanaki, K., and others. Query-based why-not provenance with nedexplain. In Extending database technology (edbt), 2014.
Stamatogiannakis, M., Groth, P., and Bos, H.. Looking inside the black-box: capturing data provenance using dynamic instrumentation. In Tapp, 2014.
Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Millstein, T., and Condie, T.. Titian: data provenance support in spark. PVLDB, 9(3)2016.
Anand, M. K., Bowers, S., McPhillips, T., and Ludaescher, B.. Efficient Provenance Storage over Nested Data Collections. In EDBT: proceedings of the 12th international conference on extending database technology, 958--969, 2009.
Chirigati, F. S., Shasha, D., and Freire, J.. Reprozip: using provenance to support computational reproducibility.. In Tapp, 2013.

CS 520: Data Integration, Warehousing, and Provenance - 2016 Spring