Hardware Acceleration for Machine Learning
We are exploring systems for machine learning with a focus on improving performance and energy efficiency on emerging hardware platforms. The configurations we explore range from single accelerator designs, to clusters of homogeneous accelerators, as well as distributed, heterogeneous hardware in the cloud. The topics we are interested in include algorithm design, data formats and representation, hardware-software co-design of algorithms and systems, systems architecture, and overall system design of ML platforms for heterogeneous hardware.
Efficient K-Means Operator for Hybrid Databases
He, Zhenhao external pageSidler, David, Dr.call_made external pageIstván, Zsolt, Prof.call_made external pageWang, Zeke, Prof.call_made
In this project, we explore the challenges and opportunities in designing efficient machine learning operators, particularly K-means, in the context of a hybrid CPU-FPGA database. One challenge of hardware acceleration in the database is the runtime parameterization of the accelerator to avoid frequent reprogramming. The other is the concurrent use of an accelerator to balance memory bandwidth and computation. Thus, we propose a parameterizable runtime design that uses low-precision input in conjunction with a standard K-Means algorithm to improve memory bandwidth utilization on hardware accelerators.
See it on external pageGitHubcall_made
Efficient processing of ML models in databases
He, Zhenhao external pageSidler, David, Dr.call_made external pageIstván, Zsolt, Prof.call_made external pageWang, Zeke, Prof.call_made
Our ZipML project explores efficient in-database ML training and inference on modern hardware. On the one hand, we build a hybrid CPU-FPGA database that supports on-the-fly machine learning training. On the other hand, we explore low-precision DNN training and inference on FPGAs: we present both single-precision floating-point and low-precision integer (8, 4, 2, and 1 bit) capable FPGA-based trainers and study the trade-offs that affect the end-to-end performance of dense linear model training.
See it on GitHub: external pageColumnMLcall_made external pageMLWeavingcall_made external pageZipML-XeonFPGAcall_made external pageZipML-PYNQcall_made
Inference of Decision Tree Ensemble on CPU-FPGA Platforms
external pageOwaida, Muhsen, Dr.call_made
We explore the design of flexible and scalable FPGA architecture. In addition, we combine CPU and FPGA processing to scale to large tree ensembles with millions of nodes. In this work, we developed an inference system for decision tree ensembles on Intel's Xeon+FPGA platform. The developed system targets XGBoost, one of the most successful boosted trees algorithms in machine learning. In our future steps, we want to explore using low precision for representing either data or tree node's threshold values, which enables the processing of even larger ensembles on the FPGA at higher performance.
Recommendation Systems on Modern Hardware
Jiang, Wenqi Zhu, Yu He, Zhenhao Zhang, Ce, Prof. Dr.
In this work, we explore hardware accelerations of DNN-based personalized recommendation systems on FPGAs and GPUs. Unlike regular DNN inference workloads, which are typically bound by computation, recommendation inference is primarily bound by memory due to the many random accesses needed to look up the embedding tables. To this end, we design MicroRec, a high-performance FPGA inference engine for recommendation systems that tackles the memory bottleneck by both computer architecture and data structure solutions. Once the memory bottleneck is removed, the DNN computation becomes the primary bottleneck due to the limited computation power delivered by FPGAs.
Thus, we further design and implement a high-performance and heterogeneous recommendation inference cluster named FleetRec that takes advantage of the strengths of both FPGAs and GPUs. We are now developing a user-friendly framework for recommendation systems that can automatically deploy models on FPGA clusters.