Industry Retreat 2025
The Industry Retreat 2025 will take place between the 12th and 15th of January 2025 at the Hotel Bellevue-Terminus, Engelberg.
Location
The Systems Group’s Industry Retreat 2025 will be held at the external page Hotel Bellevue-Terminus, Engelberg.
Program
The program, presentations, and posters are available as PDF documents at the following link. You’ll need a password to access the materials (which will be announced during the retreat).
Sunday 12th January
18:30 – 19:30 Dinner
19:30 – 20:30 Introductory session (Ana Klimovic): Welcome, introduction,
logistics, agenda, and format
Monday 13th January
08:00 – 09:00 Breakfast
09:00 – 10:30 Session 1: Serverless (Chair: Ana)
Invited Talk: Rodrigo Fonseca (Microsoft): Improving Energy, Power, and
Thermal Efficiency of LLM Serving in the Cloud
Tom Kuchler: Dandelion
Lazar Cvetkovic: Dirigent: lightweight serverless cluster management
10:30 – 11:00 Coffee Break
11:00 – 12:30 Session 2: More Efficient Networking and Data Management
(Chair: Mothy)
Invited Talk by Michal Friedmann (ETH Zurich): Efficient Synchronization
and Data Management for Disaggregated Memory Systems
Jonas Dann: In network parsing and filtering
Maximilian Heer: Balboa: RDMA for smart NICs
Pengcheng Xu: Lauberhorn: A SmartNIC that is part of the OS
12:30 – Lunch, meetings, and free time
17:00 – 18:30 Session 3: Data Processing on Modern Hardware (Chair: Jonas Dann)
Invited Talk by Satnam Singh (Groq): Inside the Engine Room of
LLMs at Groq
Yu Zhu: ML Preprocessing Pipelines
Maximilian Böther: Mixtera: LLM training data mixing platform
Marko Kabić: Maximus
19:00 – 20:00 Dinner
20:00~ Poster Session
Tuesday 14th January
08:00 – 09:00 Breakfast
09:00 – 10:45 Session 1: Trustworthy and reliable systems (Chair: Michal)
Invited Talk by Rebecca Isaacs (Amazon): Analyzing metastable failures
Sam Grütter: Formal Verification for Systems
Zikai Liu: Generating trustworthy hardware/software I2C drivers for
board management controllers
Ben Fiedler: Automated Reasoning About Memory Accesses on SoCs
Jasmin Schult: Cache control for modern NUMA architectures
10:45 – 11:10 Coffee Break
11:10 – 12:30 Session 2: Hardware Acceleration (Chair: Yazhuo Zhang)
Invited Talk – TBA
Wenqi Jiang: RAGs
Bowen Wu: Relational Queries on GPUs
12:30 – Lunch, meetings, and free time
17:00 – 18:30 Session 3: Systems for ML (Chair: David Cock)
Invited Talk by Steve Reinhardt (AMD): AI: Challenges and Opportunities
Foteini Strati: SAILOR
Yongjun He: Inference + Fine Tuning
Xiaozhe Yao: DeltaZip
Hidde Lycklama: Holding Secrets Accountable: Auditing Privacy-
Preserving Machine Learning
19:00 – 20:00 Dinner
20:00~ Poster Session
Wednesday 15th January
08:00 – 09:00 Breakfast
09:00 – 10:30 Session 7 (Chair: Gustavo)
Invited Talk by Rolf Neugebauer (Amazon): Trends and systems
research areas in ML infrastructure
Breakout brainstorming session:
Pick a question to discuss for 30 minutes, then have a representative
present a 2-minute summary of the discussion to everyone. Questions:
• How can we design hardware and software to be more trustworthy?
• What should the future server memory hierarchy look like,
e.g., with disaggregated memory? Which applications will benefit?
• How should the cloud programming model evolve and how should we
co-design cloud system software for performance and efficiency?
• How can we improve the resource efficiency and scalability of AI
and data processing systems?
10:30 – 11:00 Coffee Break
11:00 – 12:30 Open session, brainstorming, and feedback.
12:30 – 13:30 Lunch
14:02 Departure Group Train for Zurich
Industry Talks
Efficient Synchronization and Data Management for Disaggregated Memory Systems
Michal Friedman, ETH Zurich – Monday 13th January, 11:00
The rapid growth of data-driven applications has pushed the boundaries of traditional memory systems, memory systems, which decouple memory from compute nodes, have emerged as a promising paradigm to address this challenge. These systems can operate either within a shared coherency domain, enabled by technologies like Compute Express Link (CXL), or independently. This talk explores the evolving landscape of disaggregated memory and highlights a one future direction: enabling atomic operations using Remote Direct Memory Access (RDMA). This capability has the potential to significantly reduce the overhead associated with data movements and improving synchronization primitives across distributed systems, allowing more efficient coordination and data sharing.
Inside the Engine Room of LLMs at Groq
Satnam Singh, Fellow, Groq – Monday, 13th January, 17:00
Groq’s Language Processing Unit (LPU) chips are reshaping the landscape of large language model (LLM) deployment at scale. By prioritizing low latency and high throughput, our hardware and software stack enables rapid and efficient inference, making it a good fit for applications where LLMs must be invoked repeatedly by agents e.g. for solving mathematical problems. In this talk I will describe the architecture of Groq’s LPU chips, which leverage deterministic execution and distributed SRAM to deliver high performance with low latency and high throughput. I will explain how the determinism property allows us to deploy open-weight models such as Llama3-70B, Gemma2, and Mixtral 8x7B with predictable, scalable performance.
Analyzing metastable failures
Rebecca Isaacs, Amazon AWS – Tuesday, 14th January, 9:00
Metastable failures are congestive collapses in which the system does not recover after a transient stressor, such as increased load or diminished capacity, subsides. They are rare, but potentially catastrophic if the failure cascades across inter-dependent micro-services, and they are notoriously hard to diagnose and mitigate, sometimes causing prolonged outages affecting millions of users. Standard resiliency mechanisms, including retry with exponential backoff, load shedding and queue bounds, are important components of defense-in-depth to metastable failures. However, it is challenging for a service operator to configure these mechanisms appropriately while balancing performance and availability requirements. Even worse, there is no way for operators to have confidence that a given set of defensive mechanisms are sufficient to prevent future metastable failures. In this talk I will describe how we are tackling this problem at AWS with a suite of tools ranging from modeling the system as continuous-time Markov chains to discrete event simulation to emulation in the cloud.