Databases on Heterogeneous Architectures

We are exploring the use of hardware acceleration and the impact of modern hardware on data processing at all scales, from relational engines to large cloud deployments. Over time, we have built a number of novel systems that have pioneer several idea that are now widely used in industry. For all this projects, code is available as we often construct systems by taking advantage of previous efforts that are used as infrastructure for the next, more ambitious projects.

Projects

Compression and Encryption in SAP HANA
DoppioDB
ERBium
Hardware-centric Transaction Processing for Distributed Memory
Operators

Compression and Encryption in SAP HANA

Maschi, Fabio
Chiosa, Monica

With the advent of cloud computing, where computational resources are expensive and data movement needs to be secured and minimized, database management systems need to reconsider their architecture to accommodate such requirements. In this paper, we present our analysis, design and evaluation of an FPGA-based hardware accelerator for offloading compression and encryption for SAP HANA, SAP's Software-as-a-Service (SaaS) in-memory database. Firstly, we identify expensive data-transformation operations in the I/O path. Then we present the design details of a system consisting of compression followed by different types of encryption to accommodate different security levels, and identify which combinations maximize performance. We also analyze the performance benefits of offloading decryption to the FPGA followed by decompression on the CPU. The experimental evaluation using SAP HANA traces shows that analytical engines can benefit from FPGA hardware offloading. The results identify a number of important trade-offs (e.g., the system can accommodate low-latency secured transactions to high-performance use cases or offer lower storage cost by also compressing payloads for less critical use cases), and provide valuable information to researchers and practitioners exploring the nascent space of hardware accelerators for database engines.

doppioDB

doppioDB is a hardware accelerated database, it extends MonetDB with hardware user defined functions (HUDFs) to provide a seamless integration of hardware operators into the database engine. doppioDB is targeting hybrid multicore architectures such as Intel Xeon+FPGA platform or IBM's Power8 with CAPI. In these platforms the accelerator (FPGA) has direct access to the main memory. On traditional accelerators data movement to/from the accelerator is very explicit and often involves reformatting the data to the accelerator's execution model. However in doppioDB the accelerator is integrated and designed to be a specialized co-processor which can operate on the same data as the database engine. doppioDB makes use of Centaur which bridges the gap between MonetDB on the CPU side and the hardware operators implemented on the FPGA. Centaur provides a software API which is used in the HUDFs to create and monitor jobs on the FPGA.

See it on external page GitHub

ERBium

Maschi, Fabio

Business Rule Management Systems (BRMSs) are widely used in industry for a variety of tasks. Their main advantage is to codify in a succinct and queryable manner vast amounts of constantly evolving logic. In BRMSs, rules are typically captured as facts (tuples) over a collection of criteria, and checking them involves querying the collection of rules to find the best match. In this paper, we focus on a real-world use case from the airline industry: determining the minimum connection time (MCT) between flights. The MCT module is part of the flight search engine, and captures the ever changing constraints at each airport that determine the time to allocate between an arriving and a departing flight for a connection to be feasible. We explore how to use hardware acceleration to (i) improve the performance of the MCT module (lower latency, higher throughput); and (ii) reduce the amount of computing resources needed. A key aspect of the solution is the transformation of a collection of rules into a Non-deterministic Finite state Automaton efficiently implemented on FPGA. Experiments performed on-premises and in the cloud show several orders of magnitude improvement over the existing solution, and the potential to reduce by 40% the number of machines needed for the flight search engine.

See it on external page GitHub

Hardware-centric Transaction Processing for Distributed Memory

Shi, Runbin, Dr.

Transaction processing is a key part of the database management systems as well as a fundamental memory semantic of general computations. In the past 40 years, scientists have proposed many concurrency control schemes to ensure the atomicity of transactions. The software implementations introduce significant overhead. Consequently, less than 10% of the memory bandwidth is utilized for useful data processing. However, both memory bandwidth and capacity are the most expensive items in the cloud computing era. In this project, a hardware transaction management layer is built for the distributed memory system, which is composed of massive heterogeneous memory (DDR/HBM) channels and controlled by FPGA logic. The purpose of this layer is to maximize the memory bandwidth utilized for useful data processing. We specialized hardware for two fundamental concurrency control schemes, two-phase locking (2PL) and timestamp ordering (TSO), with the following novelties. First, we developed a fine-grained hardware context switch that effectively hides the latency of data dependency and concurrency control. Second, we developed a distributed on-chip lock manager that achieves both high throughput and low latency. Third, we developed serialization/deserialization modules based on our RDMA stacks to fully utilize the 100G network. With these, we extensively utilize the memory and network bandwidth and deliver an ultrahigh throughput memory management hardware for transaction processing.

Operators

external page Kara, Kaan external page Sidler, David external page István,Zsolt external page Muhsen,Owaida

Partitioning

Data partitioning is often used in databases to improve data access patterns of query execution engines. One prominent example of this is the radix join algorithm, which partitions large input tables to small cache-fitting parts, so that classic hash join can be performed with cache-fitting hash tables. The initial partitioning phase improves overall join performance significantly, especially if the input tables are large. FPGAs can perform partitioning efficiently, because (1) distributed on-chip memory (block RAMs) can be used to do specialized caching to improve random-access behavior, (2) partitioning can be implemented as a deep dataflow, enabling pipeline parallelism to improve throughput, (3) spatial parallelism on the FPGA can be exploited to create a vector-like instruction that is specialized for partitioning.

Stochastic Gradient Descent (SGD)

Using machine learning algorithms directly on relational data residing in a relational database has many advantages. Obviously, the ability the perform machine learning tasks on relational data besides having the robust and declarative way of interacting with it within a database, is very attractive. With the SGD operator, our goal is to provide the capability to perform linear model training directly on relational data residing in doppioDB, and to accelerate this using an FPGA.