Organization
Students have to form groups of TBA and each group will have to read a research paper, write a report, and give a 20 min presentation about the paper.
Presentation
The presentations will be given in a single block session on April 23 in room SB 111. You have to email your slides (e.g., powerpoint or latex) to the TA or instructor the night before your talk. The schedule for talks will be given once talks are assigned. Please read the links below on how to give a good presentation.
Schedule:
- 09am-12pm: first paper session
- 12pm-01pm: lunch break
- 01pm-06pm: second paper session
Detailed schedule:
Time | Group | Title |
---|---|---|
Data cleaning and preprocessing | ||
09:00 | 1 | Automating the approximate record-matching process |
09:20 | 8 | Text classification from labeled and unlabeled documents using EM |
09:40 | 11 | Nadeef: a commodity data cleaning system |
10:00 | 9 | Combining quantitative and logical data cleaning |
10:20 | 18 | Declarative data cleaning: language, model, and algorithms |
10:40 | 19 | Descriptive and prescriptive data cleaning |
11:00 | 23 | Declarative support for sensor data cleaning |
Integration, Matching, and Mappings | ||
11:20 | 14 | Efficiently executing information-gathering plans |
11:40 | 7 | Clio: Schema Mapping Creation and Data Exchange |
12:00 | Lunch break | |
01:00 | 5 | Integrating conflicting data: the role of source dependence |
01:20 | 4 | Generic Schema Matching with Cupid |
Data Warehousing | ||
01:40 | 10 | Lenses: an on-demand approach to etl |
02:00 | 16 | Algorithms for materialized view design in data warehousing environment |
02:20 | 3 | On-line warehouse view maintenance |
Big Data | ||
02:40 | 6 | Asterixdb: a scalable, open source bdms |
03:00 | 12 | All roads lead to rome: optimistic recovery for distributed iterative data processing |
03:20 | 15 | On the design and scalability of distributed shared-data databases |
03:40 | 17 | Spinning fast iterative data flows |
04:00 | 21 | A practical scalable distributed B-tree |
04:20 | 20 | Hyracks: a flexible and extensible foundation for data-intensive computing |
04:40 | 2 | Spark SQL: relational data processing in spark |
Provenance | ||
05:00 | 22 | Looking inside the black-box: capturing data provenance using dynamic instrumentation |
05:20 | 13 | Trio: A System for Managing Data, Uncertainty, and Lineage |
Report
You will have to write a report that summarizes the content of the paper, explains its main ideas in a way understandable by the other students in the course (they should not have to read another 20 paper to understand what you are writing about), and gives an objective critic of the presented methods or systems. There are no page limitations, but try to avoid lengthy and verbose writing as well as short and incomprehensible reports. Again read some of the links below to get some ideas about how to write a good paper or report.
Late policies:
- 1-3 days late: -10% points
- 4-7 days late: -20% points
- > 7 days late: 0 points
Time schedule
We expect the deliverables according to the following deadlines:- 02/22 - Read the paper and determine the structure of the report and meet with Prof./TA to discuss structure
- 03/28 - Initial draft of the reports due and first review meeting with Prof./TA
- 04/15 - First draft of slides due and slide review meeting with Prof./TA
- 04/18 - Final reports due
Help for writing the report, preparing slides, and giving a talk
How to give a presentation and prepare slides:- Page giving information on how to give a talk and prepare slides.
- http://www.eecs.berkeley.edu/~messer/Bad_talk.html - Bulletpoints on how to give a (bad) good talk.
- Other slides on how to give a good talk
- Page on how to write an CS article. Also comments on some general writing rules.
- Simon Peyton Jones slides and video on how to write a great research paper
Literature Review Papers
The paper for the literature review part of the course are shown below. You will have until 01/30 to build groups. You will be able to vote on papers at 01/30 10:00am until 02/01 1pm. We will send you a link to a form.
Data cleaning and preparation
- Shawn R Jeffery, Gustavo Alonso, Michael J Franklin, Wei Hong, and Jennifer Widom. Declarative support for sensor data cleaning In Pervasive computing, 83-100, Springer, 2006.
- Chalamalla, A., Ilyas, I. F., Ouzzani, M., and Papotti, P.. Descriptive and prescriptive data cleaning. In Sigmod, , , 2014.
- Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I. F., Ouzzani, M., and Tang, N.. Nadeef: a commodity data cleaning system. In Proceedings of the 2013 acm sigmod international conference on management of data, 541--552, ACM, , 2013.
- Prokoshyna, N., Szlichta, J., Chiang, F., Miller, R. J., and Srivastava, D.. Combining quantitative and logical data cleaning. 2015.
- Wang, J. and Tang, N.. Towards dependable data repairing with fixing rules. SIGMOD, 2014.
Entity resolution
- Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C., and others. Declarative data cleaning: language, model, and algorithms. VLDB, 2001.
- Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T.. Text classification from labeled and unlabeled documents using EM. Machine learning, 39(2-3):103-134, 2000.
- Cohen, W. W.. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems (TOIS), 18(3):288-321, 2000.
- Verykios, V. S., Elmagarmid, A. K., and Houstis, E. N.. Automating the approximate record-matching process. Information sciences, 126(1):83-98, 2000.
Data fusion
- Yan, L. L. and Ozsu, M. T.. Conflict tolerant queries in aurora. In Cooperative information systems, 1999. coopis, 279-290, IEEE, , 1999.
- Greco, S., Pontieri, L., and Zumpano, E.. Integrating and managing conflicting data. In Perspectives of system informatics, 349-362, Springer, 2001.
- Liu, X., Dong, X. L., Ooi, B. C., and Srivastava, D.. Online data fusion. Proceedings of the VLDB Endowment, 4(11), 2011.
- Dong, X. L., Berti-Equille, L., and Srivastava, D.. Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, 2(1):550-561, 2009.
Schema matching and mapping
- Larson, J. A., Navathe, S. B., and Elmasri, R.. A theory of attributed equivalence in databases with application to schema integration. Software Engineering, IEEE Transactions on, 15(4):449-463, 1989.
- Doan, A., Domingos, P., and Halevy, A. Y.. Reconciling schemas of disparate data sources: a machine-learning approach. In ACM SIGMOD record, volume 30, 509-520, ACM, , 2001.
- Embley, D. W., Jackman, D., and Xu, L.. Multifaceted exploitation of metadata for attribute match discovery in information integration.. In Workshop on information integration on the web, 110-117, 2001.
- Melnik, S., Garcia-Molina, H., and Rahm, E.. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In Data engineering, 2002. proceedings. 18th international conference on, 117-128, IEEE, , 2002.
- Madhavan, J., Bernstein, P. A., and Rahm, E.. Generic Schema Matching with Cupid. The VLDB Journal, page 49-58, 2001.
- Fuxman, A., Hernandez, M. A., Ho, H., Miller, R. J., Papotti, P., and Popa, L.. Nested mappings: schema mapping reloaded. In Proceedings of the 32nd international conference on very large data bases, 67-78, VLDB Endowment, , 2006.
Query answering with views and virtual data integration
- Levy, A., Rajaraman, A., and Ordille, J.. Querying heterogeneous information sources using source descriptions. 1996.
- Pottinger, R. and Halevy, A.. Minicon: a scalable algorithm for answering queries using views. The VLDB Journal, 10(2):182-198, 2001.
- Dar, S., Franklin, M. J., Jonsson, B. T., Srivastava, D., Tan, M., and others. Semantic data caching and replacement. In VLDB, volume 96, 330-341, 1996.
- Friedman, M. and Weld, D. S.. Efficiently executing information-gathering plans. In In proc. of the int. joint conf. of AI, 1997.
- Wiederhold, G.. Mediators in the architecture of future information systems. Computer, 25(3):38-49, 1992.
- Calì, A., Calvanese, D., De Giacomo, G., Lenzerini, M., Naggar, P., and Vernacotola, F. Ibis: semantic data integration at work. In Advanced information systems engineering, 79-94, Springer,2003.
Data exchange
- Karvounarakis, G., Green, T. J., Ives, Z. G., and Tannen, V.. Collaborative data sharing via update exchange and provenance. ACM Transactions on Database Systems (TODS), 38(3):19, 2013.
- Fagin, R., Kolaitis, P. G., and Popa, L.. Data Exchange: Getting to the Core. ACM Transactions on Database Systems (TODS), 30(1):174-210, 2005.
- Fuxman, A., Kolaitis, P. G., Miller, R. J., and Tan, W. C.. Peer data exchange. ACM Transactions on Database Systems (TODS), 31(4):1454-1498, 2006.
- Mecca, G., Papotti, P., Raunich, S., and Buoncristiano, M.. Concise and expressive mappings with+ spicy. Proceedings of the VLDB Endowment, 2(2):1582-1585, 2009.
- Miller, R. J., Haas, L. M., Hernández, M. A., Fagin, R., Popa, L., and Velegrakis, Y.. Clio: Schema Mapping Creation and Data Exange. Conceptual Modeling: Foundations and Applications, page 236, 2009.
Data warehousing
- Yang, J., Karlapalem, K., and Li, Q.. Algorithms for materialized view design in data warehousing environment. In Vldb, volume 97, 136-145, 1997.
- Chan, C.-Y. and Ioannidis, Y. E.. Bitmap index design and evaluation. ACM SIGMOD Record, 27(2):355-366, 1998.
- Wu, K., Otoo, E. J., and Shoshani, A.. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1):1-38, 2006.
- Klonatos, Y., Koch, C., Rompf, T., and Chafi, H.. Building efficient query engines in a high-level language. Proceedings of the VLDB Endowment, 7(10):853--864, 2014.
- Kaufmann, M., Manjili, A. A., Vagenas, P., Fischer, P. M., Kossmann, D., Faerber, F., and May, N.. Timeline index: a unified data structure for processing queries on temporal data in sap hana. In Proceedings of the 2013 international conference on management of data, 1173--1184, ACM, 2013.
- Yang, Y., Meneghetti, N., Fehling, R., Liu, Z. H., and Kennedy, O.. Lenses: an on-demand approach to etl. Proceedings of the VLDB Endowment, 8(12):1578--1589, 2015.
- Papadomanolakis, S. and Ailamaki, A.. Autopart: Automating schema design for large scientific databases using data partitioning. 2004.
- Quass, D. and Widom, J.. On-line warehouse view maintenance. In Acm sigmod record, volume 26, 393--404, ACM, , 1997.
- Agrawal, D., El Abbadi, A., Singh, A., and Yurek, T.. Efficient view maintenance at data warehouses. In Acm sigmod record, volume 26, 417--427, ACM, , 1997.
Big Data analytics
- Alsubaiee, S., Altowim, Y., Altwaijry, H., Behm, A., Borkar, V., Bu, Y., Carey, M., Cetindil, I., Cheelangi, M., Faraaz, K., and others. Asterixdb: a scalable, open source bdms. Proceedings of the VLDB Endowment, 7(14):1905--1916, 2014.
- Borkar, V., Carey, M., Grover, R., Onose, N., and Vernica, R.. Hyracks: a flexible and extensible foundation for data-intensive computing. In Data engineering (icde), 2011 ieee 27th international conference on, 1151--1162, IEEE, , 2011.
- Aguilera, M. K., Golab, W., and Shah, M. A.. A practical scalable distributed B-tree. Proc. VLDB Endow., 1(1):598--609, August 2008.
- Schelter, S., Ewen, S., Tzoumas, K., and Markl, V.. All roads lead to rome: optimistic recovery for distributed iterative data processing. In Proceedings of the 22nd acm international conference on conference on information knowledge management, 1919--1928, ACM, 2013.
- Ewen, S., Tzoumas, K., Kaufmann, M., and Markl, V.. Spinning fast iterative data flows. Proceedings of the VLDB Endowment, 5(11):1268--1279, 2012.
- Binnig, C., Hildenbrand, S., Färber, F., Kossmann, D., Lee, J., and May, N.. Distributed snapshot isolation: global transactions pay globally, local transactions pay locally. The VLDB Journal---The International Journal on Very Large Data Bases, 23(6):987--1011, 2014.
- Loesing, S., Pilman, M., Etter, T., and Kossmann, D.. On the design and scalability of distributed shared-data databases. In Proceedings of the 2015 acm sigmod international conference on management of data, 663--676, ACM, , 2015.
- Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., and Zaharia, M.. Spark sql: relational data processing in spark. In Proceedings of the acm international conference on management of data (sigmo), 1383--1394, ACM, 2015.
Provenance
- Kohler, S., Ludaescher, B., and Zinn, D.. First-order provenance games. In In search of elegance in the theory and practice of computation, page 382--399. Springer, 2013.
- Meliou, A., Gatterbauer, W., Moore, K. F., and Suciu, D.. The Complexity of Causality and Responsibility for Query Answers and non-Answers. Proceedings of the VLDB Endowment, 4(1):34--45, 2010.
- Xiao, D. and Eltabakh, M. Y.. Insightnotes: summary-based annotation management in relational databases. , 2014.
- Gehani, A. and Tariq, D.. Provenance integration. In 6Th usenix workshop on the theory and practice of provenance (tapp), 2014.
- Widom, J.. Trio: A System for Managing Data, Uncertainty, and Lineage. Managing and Mining Uncertain Data, page 113-148, 2008.
- Geerts, F., Kementsietsidis, A., and Milano, D. Mondrian: annotating and querying databases through colors and blocks. In Data engineering, 2006. ICDE. proceedings of the 22nd international conference on, 82--82, IEEE, 2006.
- Bidoit, N., Herschel, M., Tzompanaki, K., and others. Query-based why-not provenance with nedexplain. In Extending database technology (edbt), 2014.
- Stamatogiannakis, M., Groth, P., and Bos, H.. Looking inside the black-box: capturing data provenance using dynamic instrumentation. In Tapp, 2014.
- Interlandi, M., Shah, K., Tetali, S. D., Gulzar, M. A., Yoo, S., Kim, M., Millstein, T., and Condie, T.. Titian: data provenance support in spark. PVLDB, 9(3)2016.
- Anand, M. K., Bowers, S., McPhillips, T., and Ludaescher, B.. Efficient Provenance Storage over Nested Data Collections. In EDBT: proceedings of the 12th international conference on extending database technology, 958--969, 2009.
- Chirigati, F. S., Shasha, D., and Freire, J.. Reprozip: using provenance to support computational reproducibility.. In Tapp, 2013.