Industry Retreat 2025

The Industry Retreat 2025 took place between the 12th and 15th of January 2025 at the Hotel Bellevue-Terminus, Engelberg.

Location
Program
Industry Talks

Location

The Systems Group’s Industry Retreat 2025 will be held at the external page Hotel Bellevue-Terminus, Engelberg.

Program

The program, presentations, and posters are available as PDF documents at the following link. You’ll need a password to access the materials (which will be announced during the retreat).

Sunday 12th January

18:30 – 19:30 Dinner
19:30 – 20:30 Introductory session (Ana Klimovic): Welcome, introduction,
logistics, agenda, and format

Monday 13th January

08:00 – 09:00 Breakfast

09:00 – 10:30 Session 1: Serverless (Chair: Ana)
Invited Talk: Rodrigo Fonseca (Microsoft): Improving Energy, Power, and
Thermal Efficiency of LLM Serving in the Cloud
Tom Kuchler: Dandelion
Lazar Cvetkovic: Dirigent: lightweight serverless cluster management

10:30 – 11:00 Coffee Break

11:00 – 12:30 Session 2: More Efficient Networking and Data Management
(Chair: Mothy)
Invited Talk by Michal Friedmann (ETH Zurich): Efficient Synchronization
and Data Management for Disaggregated Memory Systems
Jonas Dann: In network parsing and filtering
Maximilian Heer: Balboa: RDMA for smart NICs
Pengcheng Xu: Lauberhorn: A SmartNIC that is part of the OS

12:30 – Lunch, meetings, and free time

17:00 – 18:30 Session 3: Data Processing on Modern Hardware (Chair: Jonas Dann)
Invited Talk by Satnam Singh (Groq): Inside the Engine Room of
LLMs at Groq
Yu Zhu: ML Preprocessing Pipelines
Maximilian Böther: Mixtera: LLM training data mixing platform
Marko Kabić: Maximus

19:00 – 20:00 Dinner

20:00~ Poster Session

Tuesday 14th January

08:00 – 09:00   Breakfast

09:00 – 10:45   Session 1: Trustworthy and reliable systems (Chair: Michal)
  Invited Talk by Rebecca Isaacs (Amazon): Analyzing metastable failures
  Sam Grütter: Formal Verification for Systems
  Zikai Liu: Generating trustworthy hardware/software I2C drivers for
                             board management controllers
  Ben Fiedler: Automated Reasoning About Memory Accesses on SoCs
  Jasmin Schult: Cache control for modern NUMA architectures

10:45 – 11:10   Coffee Break

11:10 – 12:30   Session 2: Hardware Acceleration (Chair: Yazhuo Zhang)
Invited Talk – TBA
Wenqi Jiang: RAGs
Bowen Wu: Relational Queries on GPUs

12:30 –   Lunch, meetings, and free time

17:00 – 18:30 Session 3: Systems for ML (Chair: David Cock)
Invited Talk by Steve Reinhardt (AMD): AI: Challenges and Opportunities
Foteini Strati: SAILOR
Yongjun He: Inference + Fine Tuning
Xiaozhe Yao: DeltaZip
Hidde Lycklama: Holding Secrets Accountable: Auditing Privacy-
Preserving Machine Learning

19:00 – 20:00   Dinner

20:00~   Poster Session

Wednesday 15th January

08:00 – 09:00 Breakfast

09:00 – 10:30 Session 7 (Chair: Gustavo)
Invited Talk by Rolf Neugebauer (Amazon): Trends and systems
research areas in ML infrastructure
Breakout brainstorming session:
Pick a question to discuss for 30 minutes, then have a representative
present a 2-minute summary of the discussion to everyone. Questions:

• How can we design hardware and software to be more trustworthy?
• What should the future server memory hierarchy look like,
e.g., with disaggregated memory? Which applications will benefit?
• How should the cloud programming model evolve and how should we
co-design cloud system software for performance and efficiency?
• How can we improve the resource efficiency and scalability of AI
and data processing systems?

10:30 – 11:00 Coffee Break

11:00 – 12:30 Open session, brainstorming, and feedback.

12:30 – 13:30 Lunch

14:02 Departure Group Train for Zurich

Industry Talks

Efficient Synchronization and Data Management for Disaggregated Memory Systems

Michal Friedman, ETH Zurich – Monday 13th January, 11:00

The rapid growth of data-driven applications has pushed the boundaries of traditional memory systems, memory systems, which decouple memory from compute nodes, have emerged as a promising paradigm to address this challenge. These systems can operate either within a shared coherency domain, enabled by technologies like Compute Express Link (CXL), or independently. This talk explores the evolving landscape of disaggregated memory and highlights a one future direction: enabling atomic operations using Remote Direct Memory Access (RDMA). This capability has the potential to significantly reduce the overhead associated with data movements and improving synchronization primitives across distributed systems, allowing more efficient coordination and data sharing.

Inside the Engine Room of LLMs at Groq

Satnam Singh, Fellow, Groq – Monday, 13th January, 17:00

Groq’s Language Processing Unit (LPU) chips are reshaping the landscape of large language model (LLM) deployment at scale. By prioritizing low latency and high throughput, our hardware and software stack enables rapid and efficient inference, making it a good fit for applications where LLMs must be invoked repeatedly by agents e.g. for solving mathematical problems. In this talk I will describe the architecture of Groq’s LPU chips, which leverage deterministic execution and distributed SRAM to deliver high performance with low latency and high throughput. I will explain how the determinism property allows us to deploy open-weight models such as Llama3-70B, Gemma2, and Mixtral 8x7B with predictable, scalable performance.

Analyzing metastable failures

Rebecca Isaacs, Amazon AWS – Tuesday, 14th January, 9:00

Metastable failures are congestive collapses in which the system does not recover after a transient stressor, such as increased load or diminished capacity, subsides. They are rare, but potentially catastrophic if the failure cascades across inter-dependent micro-services, and they are notoriously hard to diagnose and mitigate, sometimes causing prolonged outages affecting millions of users. Standard resiliency mechanisms, including retry with exponential backoff, load shedding and queue bounds, are important components of defense-in-depth to metastable failures. However, it is challenging for a service operator to configure these mechanisms appropriately while balancing performance and availability requirements. Even worse, there is no way for operators to have confidence that a given set of defensive mechanisms are sufficient to prevent future metastable failures. In this talk I will describe how we are tackling this problem at AWS with a suite of tools ranging from modeling the system as continuous-time Markov chains to discrete event simulation to emulation in the cloud.