Computer System Design for Machine Learning

Machine learning (ML) jobs are an increasingly important class of applications in the cloud. Across domains such as image understanding and text translation, scaling machine learning models to a large number of parameters has been shown to dramatically improve accuracy when sufficiently large datasets are used. While significant work has focused on optimizing hardware and software for ML computations, data management is a common bottleneck. As organizations collect massive amounts of data, external page storing and ingesting data at this scale poses several challenges.
Research topics: How should we design distributed storage systems for machine learning to optimize end-to-end model training and inference? How can we avoid moving large amounts of data across the network — should we instead move computation closer to the data (near-storage computing)? How can multiple tenants safely share datasets and models with good performance guarantees?