Ana Klimovic gave an invited talk at SIGMOD about Storage & Input Data Processing Design for Machine Learning

Ana Klimovic gave an invited talk about Storage & Input Data Processing Design for Machine Learning at external pageSIGMOD 2021

Abstract:

Machine learning applications have sparked the development of specialized software frameworks and hardware accelerators. Yet, in today’s machine learning ecosystem, one important part of the system stack has received far less attention and specialization for ML: how we store and preprocess training data. This talk describes some key challenges for implementing high-performance ML input data processing pipelines. We analyze millions of ML jobs running in Google's fleet and find that input pipeline performance significantly impacts end-to-end training performance and resource consumption. Our study shows that ingesting and preprocessing data on-the-fly during training consumes 30% of end-to-end training time, on average. Our characterization of input data pipelines motivates several systems research directions, such as disaggregating input data processing from model training and caching commonly reoccurring input data computation subgraphs.




 

JavaScript has been disabled in your browser