Ana Klimovic gave an invited talk at the 48th International Conference on Very Large Databases (VLDB)

08.09.2022

Read
Number of comments

Ana Klimovic gave an invited talk at the external page VLDB 2022 Scalable Data Science session:

Title: Input Data Processing for Machine Learning

Abstract:

Processing input data plays a vital role in ML training. We analyze millions of ML jobs running in Google's fleet and find that the input pipeline --- which is responsible for feeding data-hungry GPUs/TPUs with training examples --- significantly impacts end-to-end training time and cost. Our characterization of input data pipelines motivates several system design explorations, such as disaggregating input data processing from model training and caching commonly reoccurring input data computation subgraphs. Hence, we present Cachew, a fully-managed multi-tenant service for ML data processing which builds on tf.data. Cachew dynamically scales distributed resources for data processing to avoid stalls in training jobs. The service also selectively caches source data and/or preprocessed data to maximize training throughput and cost within and across jobs. Cachew's key contributions are autoscaling and autocaching policies, which optimize training time and cost by leveraging domain-specific metrics collected at data workers and training clients, rather than generic resource utilization metrics. We conclude with a discussion of open research questions for ML input data management and processing.