Paper "ACCL+: an FPGA-Based Collective Engine for Distributed Applications" accepted at OSDI'24

23.03.2024

Read
Number of comments

The following paper has been accepted at the 18th USENIX Symposium on Operating Systems Design and Implementation (external page OSDI'24) to be held in Santa Clara (CA), USA on 10–12 July 2024.

This work has been done in collaboration with AMD Research.

Title
ACCL+: an FPGA-Based Collective Engine for Distributed Applications

Authors
Zhenhao He (ETH Zürich), Dario Korolija (ETH Zürich), Yu Zhu (ETH Zürich), Benjamin Ramhorst (ETH Zürich), Tristan Laan (University of Amsterdam), Lucian Petrica (Research Labs AMD Xilinx), Michaela Blott (Research Labs AMD Xilinx), Gustavo Alonso (ETH Zürich)

Abstract
FPGAs are increasingly prevalent in cloud deployments, serving as Smart NICs or network-attached accelerators. Despite their potential, developing distributed FPGA-accelerated applications remains cumbersome due to the lack of appropriate infrastructure and communication abstractions.

To facilitate the development of distributed applications with FPGAs, in this paper we propose ACCL+, an open-source versatile FPGA-based collective communication library. Portable across different platforms and supporting UDP, TCP, as well as RDMA, ACCL+
empowers FPGA applications to initiate direct FPGA-to-FPGA collective communication. Additionally, it can serve as a collective offload engine for CPU applications, freeing the CPU from networking tasks. It is user-extensible, allowing new collectives to be implemented and deployed without having to re-synthesize the FPGA circuit.

We evaluated ACCL+ on an FPGA cluster with 100 Gb/s networking, comparing its performance against software MPI over RDMA. The results demonstrate ACCL+'s significant advantages for FPGA-based distributed applications and highly competitive performance for CPU applications.

We showcase ACCL+'s dual role with two use cases: seamlessly integrating as a collective offload engine to distribute CPU-based vector-matrix multiplication, and serving as a crucial and efficient component in designing fully FPGA-based distributed deep-learning recommendation inference.