Near-Data Processing

Data movement is one of the most expensive operations in terms of energy, latency, and resource consumption in modern, large scale data processing. We are exploring different architectures and configurations to move processing closer to the data by looking into different architectural options as well as developing new data processing algorithms that can be used for near data processing in a variety of context and hardware configurations.

Caribou

external page István, Zsolt external page Sidler, David

In this project we look beyond logical specialization of storage nodes for data processing applications and explore how physical specialization can offer benefits in terms of throughput and latency, but not only. Thanks to hardware pipelining, complex application-specific processing can be pushed down to the storage without impacting performance. The resulting system, Caribou, provides a key-value store interface common to many data processing applications and has a modular architecture that allows plugging in different application-specific processing units (e.g., complex filtering predicates for SQL queries).

Caribou offers a key-value store API and the ability to push down application-specific processing make it suitable for prototyping different kinds of smart storage. Since the processing logic is plugged into the key-value store using simple streaming interfaces, the design of compute units is greatly simplified and data management is already taken care of. This reduces the "entry barrier" to exploring new ideas.

See it on external page GitHub

Consensus in a Box

external page István, Zsolt external page Sidler, David

Caribou relies on our previous work on FPGA-based distributed key-value stores. We showed that by carefully tuning the design of the hash table to the underlying FPGA, it is possible to achieve an order of magnitude larger performance compared to the state of the art in x86 processors. An FPGA-based key-value store connected directly to the network not only could replace several regular servers in terms of performance, but also dramatically reduces round-trip latency, and increases energy efficiency. We show that it is possible to provide fault tolerance at scale in hardware, and that consensus (Zookeeper's Atomic Broadcast, ZAB) can be removed from the critical path of performance by moving it to hardware. We implemented an all-hardware equivalent to Zookeeper that uses both TCP and an application specific network protocol. The design can be used to push more value into the network, e.g., by extending the functionality of middleboxes or adding inexpensive consensus to in-network processing nodes.

Farview

Korolija, Dario

Cloud deployments disaggregate storage from computing, providing more flexibility to the storage and compute layers. With Farview, we have taken a step forward by extending the same principles to primary memory. Disaggregated memory uses network-attached DRAM as a way to decouple memory from the CPU. This is especially interesting for database applications; such a design offers significant advantages in making a larger memory capacity available as a central pool to a collection of smaller processing nodes. Farview is implemented as an FPGA-based smart NIC, making DRAM available as a disaggregated, network-attached memory module capable of processing data at a line rate over data streams to/from disaggregated memory. It also supports query offloading using selection, projection, aggregation, regular expression matching, and encryption operators.