The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled OLAP processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively Parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. We also support serializable transactions for compatibility even though the system has not been optimized for this. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implemen- tations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS’s scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.
@inproceedings{AG19, author = {Arnold, Jason and Glavic, Boris and Raicu, Ioan}, booktitle = {Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium}, keywords = {HRDBMS}, pdfurl = {http://cs.iit.edu/%7edbgroup/assets/pdfpubls/AG19.pdf}, projects = {HRDBMS}, pages = {738-748}, doi = {10.1109/IPDPS.2019.00083}, title = {{A High-Performance Distributed Relational Database System for Scalable OLAP Processing}}, venueshort = {IPDPS}, year = {2019} }