HRDBMS
HRDBMS is a novel distributed relational database that combines the best of traditional (distributed) relational databases with ideas from modern distributed dataflow engines such as Hadoop or Spark. This allows HRDBMS to take advantage of years worth of research regarding query optimization, while also taking advantage of the scalability of Big Data platforms. The system was build from ground up to avoid many of the bottlenecks of SQL on Hadoop and Spark as well as the scalability issues of most traditional relational DBMS. The ultimate goal is to build a system that combines the per node performance of relational databases with the scalability of Big Data platforms. Some of the unique and not so unique features of HRDBMS are:
- A cost-based query optimizer
- Fully parallel and distributed execution engine
- Support for index structures
- Automatic caching through a rather traditional buffer manager
- Support for efficient disk-based query execution using proven traditional query execution algorithms
- Support for transactions
- A non-blocking shuffle implementation
- Support for horizontal partitioning and locality-aware query processing
Collaborators
- Ioan Raicu - Illinois Institute of Technology
Publications
-
A High-Performance Distributed Relational Database System for Scalable OLAP Processing
Jason Arnold, Boris Glavic and Ioan Raicu
Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium (2019), pp. 738–748.@inproceedings{AG19, author = {Arnold, Jason and Glavic, Boris and Raicu, Ioan}, booktitle = {Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium}, keywords = {HRDBMS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG19.pdf}, projects = {HRDBMS}, pages = {738-748}, doi = {10.1109/IPDPS.2019.00083}, title = {{A High-Performance Distributed Relational Database System for Scalable OLAP Processing}}, venueshort = {IPDPS}, year = {2019} }
The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled OLAP processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. We also support serializable transactions for compatibility even though the system has not been optimized for this. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implemen- tations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS’s scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.
-
Improving Data-Shuffle Performance In Data-Parallel Distributed Systems
Shweelan Samson
Illinois Institute of Technology.@mastersthesis{S18, author = {Samson, Shweelan}, date-added = {2018-08-20 18:55:49 +0000}, date-modified = {2018-08-20 18:55:49 +0000}, keywords = {Big Data; HRDBMS; Distributed Databases}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/S18.pdf}, projects = {HRDBMS}, school = {Illinois Institute of Technology}, title = {{Improving Data-Shuffle Performance In Data-Parallel Distributed Systems}}, venueshort = {Master Thesis}, year = {2018}, bdsk-url-1 = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/S18.pdf} }
-
HRDBMS: A NewSQL Database for Analytics
Jason Arnold, Boris Glavic and Ioan Raicu
Proceedings of the IEEE International Conference on Cluster Computing (Poster) (2015).@inproceedings{AG15b, author = {Arnold, Jason and Glavic, Boris and Raicu, Ioan}, booktitle = {Proceedings of the IEEE International Conference on Cluster Computing (Poster)}, keywords = {Big Data; HRDBMS; Distributed Databases}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG15b.pdf}, projects = {HRDBMS}, title = {HRDBMS: A NewSQL Database for Analytics}, venueshort = {Cluster}, year = {2015} }