COMPASS Talks

The Computing Platforms Seminar Series (COMPASS) is focused on talks by industry and academia around the general topic of computing platforms.

LegoAI: Auto-Scaling Large Model Training

Abstract:

Training large AI models is highly complex and resource-intensive, often requiring extensive manual tuning and  experimentation across a fragmented ecosystem of distributed training frameworks. This talk introduces LegoAI, a system that automates and unifies the process of distributed  AI training by automatically selecting optimal configurations and  synthesizing scalable, production-ready implementations for any given  model, dataset, and hardware setup. LegoAI decomposes and recombines state-of-the-art training strategies into modular  components, enabling both the deployment of existing algorithms and the creation of entirely new ones. Through high-fidelity simulation, it predicts memory and runtime with high accuracy, allowing safe and efficient exploration of the configuration space. Evaluations  show that legoai delivers significant speedups, accurate predictions,  and the ability to synthesize novel, memory-efficient training algorithms, ultimately reducing the cost, complexity, and uncertainty of large-scale AI training.

Bio:

I am a Senior Research Engineer with the PyTorch team at Meta. Before joining PyTorch, I completed my PhD at Harvard University, advised by Prof. Stratos Idreos. My research focuses on systems for deep learning, specifically optimizing training and inference workloads by leveraging operating systems, compiler technologies, and computer architecture. Previously, I was a Visiting Researcher with Meta’s PyTorch Distributed team, where I contributed to the PyTorch Distributed Stack, developing near-optimal auto-parallel solutions to improve scalability and efficiency in distributed deep learning by balancing compute, memory, and communication. I also collaborated with Meta’s PyTorch Compilers Team as a Student Researcher, working on compiler-level graph optimizations to reduce communication overhead and enhance scalability. Earlier, I interned at Microsoft Research India and worked as a Research Associate at the Indian Institute of Science (IISc), focusing on query optimization and robust algorithms, where I earned my master's degree.

 


Past COMPASS talks

JavaScript has been disabled in your browser