COMPASS Talks
The Computing Platforms Seminar Series (COMPASS) is focused on talks by industry and academia around the general topic of computing platforms.
Transparent and Unified CPU-GPU Snapshots for Scalable and Elastic Serverless AI
Abstract
Serverless platforms promise high agility and resource efficiency, yet applying them to modern AI workloads introduces severe infrastructure challenges. Inference engines suffer from minutes-long cold starts to load massive models, while long-running distributed training jobs frequently encounter hardware faults that waste vast amounts of compute and energy. Historically, Checkpoint/Restore (C/R) mechanisms have solved these issues for traditional cloud workloads. However, applying C/R to massive, highly asynchronous GPU states remains difficult. Existing solutions rely either on framework-specific application modifications—breaking the serverless abstraction—or on brittle API interposition that introduces steady-state performance overheads and requires invasive container injection.
In this talk, I will present criuGPU, a novel transparent and unified C/R engine designed for modern GPU-accelerated workloads. Unlike prior systems, criuGPU does not rely on application awareness or API interception. Instead, criuGPU augments CRIU (a userspace C/R engine) with recently-added C/R driver-level support across both NVIDIA (CUDA) and AMD (ROCm) GPUs. The resulting C/R engines halts device execution and transparently captures a globally consistent state of both the CPU and GPU. criuGPU overcomes the opacity of driver APIs, orchestrates asynchronous execution, and mitigates disk I/O bottlenecks when dumping massive GPU states. The talk will highlight preliminary results on how criuGPU accelerates inference engine cold starts and reduces checkpoint time during long trainningjobs.
Bio
Rodrigo Bruno is an Assistant Professor at Instituto Superior Técnico (IST), University of Lisbon, and a Senior Researcher at INESC-ID Lisbon. Before, he was a Senior Researcher at Oracle Labs Zurich, a Post-doc in the Systems Group at ETH Zurich, and a CS PhD at IST, where he got his PhD in 2018. His research focuses on the intersection between Systems and Programming Languages, where he spends significant time optimizing language runtimes for scalable, elastic, and heterogeneous cloud environments. Rodrigo publishes his work in top conferences such as OSDI, ASPLOS, PLDI, EuroSyS, ATC. Besides, Rodrigo and his team regularly contribute their research output to open-source projects such as the OpenJDK HotSpot Java Virtual Machine (JVM), GraalVM, V8, CRIU, and Kubernetes, among others.