Organization
Students have to form groups of two and each group will have to read a research paper, write a report, and give a 15 min presentation about the paper.
Presentation
The 15min presentation will be given in class. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. Please read the links below on how to give a good presentation.
Report
You will have to write a report that summarizes the content of the paper, explains its main ideas in a way understandable by the other students in the course (they should not have to read another 20 paper to understand what you are writing about), and gives an objective critic of the presented methods or systems. There are no page limitations, but try to avoid lengthy and verbose writing as well as short and incomprehensible reports. Again read some of the links below to get some ideas about how to write a good paper or report.
All reports are due by the end of the semester at 04/25.
Late policies:
- 1-3 days late: -10% points
- 4-7 days late: -20% points
- > 7 days late: 0 points
Help for writing the report, preparing slides, and giving a talk
How to give a presentation and prepare slides:- Page giving information on how to give a talk and prepare slides.
- http://www.eecs.berkeley.edu/~messer/Bad_talk.html - Bulletpoints on how to give a (bad) good talk.
- Other slides on how to give a good talk
- Page on how to write an CS article. Also comments on some general writing rules.
- Simon Peyton Jones slides and video on how to write a great research paper
Literature Review Papers
The paper for the literature review part of the course are shown below. You will have until 01/30 to build groups. You will be able to vote on papers at 01/30 10:00am until 02/01 1pm. We will send you a link to a form.
Data cleaning and preparation
- Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, 1999.
- P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746-755, 2007.
- Mohamed Yakout, Laure Berti-Equille, and Ahmed K Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. SIGMOD, 553-564, 2013.
- Shawn R Jeffery, Gustavo Alonso, Michael J Franklin, Wei Hong, and Jennifer Widom. Declarative support for sensor data cleaning In Pervasive computing, 83-100, Springer, 2006.
- Borkar, V., Deshmukh, K., and Sarawagi, S.. Automatic segmentation of text into structured records. In ACM SIGMOD record, volume 30, 175-186, 2001.
- Kukich, K.. Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4):377-439, 1992.
Entity resolution
- Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering. 2007;19(1):1–16.
- Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C., and others. Declarative data cleaning: language, model, and algorithms. VLDB, 2001.
- Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T.. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3):103-134, 2000.
- Monge, A. and Elkan, C. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD, 1997.
- Cohen, W. W.. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems (TOIS), 18(3):288-321, 2000.
- Verykios, V. S., Elmagarmid, A. K., and Houstis, E. N.. Automating the approximate record-matching process. Information sciences, 126(1):83-98, 2000.
Data fusion
- Yan, L. L. and Ozsu, M. T.. Conflict tolerant queries in aurora. In Cooperative information systems, 1999. coopis, 279-290, IEEE, , 1999.
- Greco, S., Pontieri, L., and Zumpano, E.. Integrating and managing conflicting data. In Perspectives of system informatics, 349-362, Springer, 2001.
- Liu, X., Dong, X. L., Ooi, B. C., and Srivastava, D.. Online data fusion. Proceedings of the VLDB Endowment, 4(11), 2011.
- Dong, X. L., Berti-Equille, L., and Srivastava, D.. Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, 2(1):550-561, 2009.
Schema matching and mapping
- Doan, A., Domingos, P., and Levy, A. Y. Learning source description for data integration. In Webdb, 81-86, 2000.
- Bergamaschi, S., Castano, S., and Vincini, M. Semantic integration of semistructured and structured data sources. ACM Sigmod Record, 28(1):54-59, 1999.
- Larson, J. A., Navathe, S. B., and Elmasri, R.. A theory of attributed equivalence in databases with application to schema integration. Software Engineering, IEEE Transactions on, 15(4):449-463, 1989.
- Doan, A., Domingos, P., and Halevy, A. Y.. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD record, volume 30, 509-520, ACM, , 2001.
- Embley, D. W., Jackman, D., and Xu, L.. Multifaceted exploitation of metadata for attribute match discovery in information integration.. In Workshop on information integration on the web, 110-117, 2001.
- Melnik, S., Garcia-Molina, H., and Rahm, E.. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In Data engineering, 2002. proceedings. 18th international conference on, 117-128, IEEE, , 2002.
- Madhavan, J., Bernstein, P. A., and Rahm, E.. Generic Schema Matching with Cupid. The VLDB Journal, page 49-58, 2001.
- Fuxman, A., Hernandez, M. A., Ho, H., Miller, R. J., Papotti, P., and Popa, L.. Nested mappings: schema mapping reloaded. In Proceedings of the 32nd international conference on very large data bases, 67-78, VLDB Endowment, , 2006.
Query answering with views and virtual data integration
- Levy, A., Rajaraman, A., and Ordille, J.. Querying heterogeneous information sources using source descriptions. 1996.
- Pottinger, R. and Halevy, A.. Minicon: a scalable algorithm for answering queries using views. The VLDB Journal, 10(2):182-198, 2001.
- Goldstein, J. and Larson, P. Å.. Optimizing queries using materialized views: a practical, scalable solution. ACM SIGMOD Record, 30(2):331-342, 2001.
- Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., Tan, M., and others. Semantic data caching and replacement. In VLDB, volume 96, 330-341, 1996.
- Friedman, M. and Weld, D. S.. Efficiently executing information-gathering plans. In In proc. of the int. joint conf. of AI, 1997.
- Wiederhold, G.. Mediators in the architecture of future information systems. Computer, 25(3):38-49, 1992.
- Calì, A., Calvanese, D., De Giacomo, G., Lenzerini, M., Naggar, P., and Vernacotola, F. Ibis: semantic data integration at work. In Advanced information systems engineering, 79-94, Springer,2003.
Data exchange
- Karvounarakis, G., Green, T. J., Ives, Z. G., and Tannen, V.. Collaborative data sharing via update exchange and provenance. ACM Transactions on Database Systems (TODS), 38(3):19, 2013.
- Fagin, R., Kolaitis, P. G., and Popa, L.. Data Exchange: Getting to the Core. ACM Transactions on Database Systems (TODS), 30(1):174-210, 2005.
- Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L.. Data Exchange: Semantics and Query Answering. Theoretical Computer Science, 336(1):89-124, 2005.
- Fuxman, A., Kolaitis, P. G., Miller, R. J., and Tan, W. C.. Peer data exchange. ACM Transactions on Database Systems (TODS), 31(4):1454-1498, 2006.
- Mecca, G., Papotti, P., Raunich, S., and Buoncristiano, M.. Concise and expressive mappings with+ spicy. Proceedings of the VLDB Endowment, 2(2):1582-1585, 2009.
- Miller, R. J., Haas, L. M., Hernández, M. A., Fagin, R., Popa, L., and Velegrakis, Y.. Clio: Schema Mapping Creation and Data Exange. Conceptual Modeling: Foundations and Applications, page 236, 2009.
Data warehousing
- Yang, J., Karlapalem, K., and Li, Q.. Algorithms for materialized view design in data warehousing environment. In Vldb, volume 97, 136-145, 1997.
- Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393-404, ACM, 1997.
- Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H.. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1(1):29-53, 1997.
- Chaudhuri, S. and Narasayya, V.. Self-tuning database systems: a decade of progress. In Proceedings of the 33rd international conference on very large data bases, 3-14, VLDB Endowment, 2007.
- Chan, C.-Y. and Ioannidis, Y. E.. Bitmap index design and evaluation. ACM SIGMOD Record, 27(2):355-366, 1998.
- Wu, K., Otoo, E. J., and Shoshani, A.. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1):1-38, 2006.
- Muralikrishna, M.. Improved Unnesting Algorithms for Join Aggregate SQL Queries. In VLDB '92: proceedings of the 18th international conference on very large data bases, 91-102, 1992.
Big Data analytics
- Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., Liu, H., and Murthy, R.. Hive-a petabyte scale data warehouse using hadoop. In Data engineering (icde), 2010 ieee 26th international conference on, 996--1005, IEEE, 2010.
- Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., and Schad, J.. Hadoop++: making a yellow elephant run like a cheetah. PVLDB, 3(1):518-529, 2010.
- Dean, J. and Ghemawat, S.. Mapreduce: simplified data processing on large clusters. In Proceedings of the 6th conference on symposium on opearting systems design and implementation - volume 6, OSDI, 2004.
- Lim, H., Herodotou, H., and Babu, S.. Stubby: a transformation-based optimizer for mapreduce workflows. Proceedings of the VLDB Endowment, 5(11):1196--1207, 2012.
- Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., and Vassilakis, T.. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330--339, 2010.
- Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., and others. Spanner: google's globally-distributed database. OSDI, page 1, 2012.
- Leich, M., Adamek, J., Schubotz, M., Heise, A., Rheinlaender, A., and Markl, V.. Applying stratosphere for big data analytics.. In Btw, 507--510, 2013.
- Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G.. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 acm sigmod international conference on management of data, 135--146, ACM, 2010.
- Gunda, P. K., Ravindranath, L., Thekkath, C. A., Yu, Y., and Zhuang, L.. Nectar: automatic management of data and computation in datacenters.. In OSDI, 75--88, 2010.
Provenance
- Kohler, S., Ludascher, B., and Zinn, D.. First-order provenance games. In In search of elegance in the theory and practice of computation, page 382--399. Springer, 2013.
- Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D.. The Complexity of Causality and Responsibility for Query Answers and non-Answers. Proceedings of the VLDB Endowment, 4(1):34--45, 2010.
- Bhagwat, D., Chiticariu, L., Tan, W.-C., and Vijayvargiya, G.. An Annotation Management System for Relational Databases. In Vldb '04: proceedings of the 30th international conference on very large data bases, 900--911, 2004.
- Xiao, D. and Eltabakh, M. Y.. Insightnotes: summary-based annotation management in relational databases. , 2014.
- Gehani, A. and Tariq, D.. Spade: support for provenance auditing in distributed environments. SRI International, 2011.
- Gehani, A. and Tariq, D.. Provenance integration. In 6Th usenix workshop on the theory and practice of provenance (tapp), 2014.
- Widom, J.. Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, page 113-148, 2008.
- Geerts, F., Kementsietsidis, A., and Milano, D. Mondrian: annotating and querying databases through colors and blocks. In Data engineering, 2006. ICDE. proceedings of the 22nd international conference on, 82--82, IEEE, 2006.
- Bidoit, N., Herschel, M., Tzompanaki, K., and others. Query-based why-not provenance with nedexplain. In Extending database technology (edbt), 2014.
- Stamatogiannakis, M., Groth, P., and Bos, H.. Looking inside the black-box: capturing data provenance using dynamic instrumentation. In Tapp, 2014.
- Vansummeren, S. and Cheney, J.. Recording Provenance for SQL Queries and Updates. IEEE Data Engineering Bulletin, 30(4):29--37, 2007.