Scaling database systems to high-performance computers
Processing massive datasets quickly requires warehouse-scale computers. These computers consist of thousands of compute nodes, offer petabytes of main memory and are interconnected with RDMA-capable networks. However, they have very limited I/O bandwidth to shared, cold storage compared to clusters of commodity servers. Parallel database systems have been designed for small shared-nothing clusters and cannot fully utilize a high-performance computer. In addition, many massive datasets, especially in science, are not naturally represented as tables to be queried using SQL. Despite decades of data management research, many users still write format-specific, imperative code to sift through data.
In this talk, we will first present ArrayBridge, a bi-directional array view mechanism for the HDF5 array file format. ArrayBridge allows scientists to use SciDB, TensorFlow and HDF5-based code in the same file-centric analysis pipeline without converting between file formats. Under the hood, ArrayBridge manages I/O to leverage the massive concurrency of warehouse-scale parallel file systems without modifying the HDF5 API and breaking backwards compatibility with legacy applications. Once the data has been loaded in memory, the bottleneck in many array-centric queries becomes the speed of data repartitioning between different nodes. We will present an RDMA-aware data shuffling abstraction that directly converses with the network adapter in InfiniBand verbs and can repartition data up to 4X faster than MPI. We conclude by highlighting research opportunities that need to be overcome for database systems to scale to warehouse-scale computers.
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high-performance database systems, and his current research goal is to build a database system for high-end scientific computing facilities. He received his Ph.D. at the University of Wisconsin–Madison. Part of his Ph.D. dissertation was commercialized in Microsoft SQL Server 2014 as the Hekaton in-memory transaction processing engine.