Optimizing Compiler Infrastructure for ML Pipelines

ABSTRACT:

Existing machine learning (ML) systems focus primarily on efficient model training and scoring. However, the data science process is exploratory, and deals with underspecified objectives, a range of different ML algorithms, and a wide variety of heterogeneous data sources with errors and inconsistencies. In a first part of this talk, we give an overview of Apache SystemDS, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local and distributed model training, to debugging and serving. A key observation is that state-of-the-art data preparation and model debugging techniques are themselves largely based on ML.
This observation motivates a stack of language abstractions on top of ML systems for different data science lifecycle tasks and users, as well as optimizing compiler infrastructure for eliminating the increasing computational redundancy. We describe the overall system architecture of SystemDS and key compilation techniques as well as new results on fine-grained lineage tracing and reuse, federated learning, and model debugging. In a second part of this talk, we then discuss the extended vision of the DAPHNE project, as a joint effort of several research groups, on building system infrastructure for integrated data analysis pipelines including data management and query processing, machine learning, and high performance computing. We discuss the MLIR-based compiler, as well as work toward extensibility for operations and data types, and integrating computational storage and heterogeneous accelerator devices.

BIO:

Matthias Boehm is a BMK-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the co-located Know-Center GmbH. His cross-organizational research group focuses on high-level, data science-centric abstractions as well as systems and tools to execute these tasks in an efficient and scalable manner. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing. Matthias is a recipient of the 2016 VLDB Best Paper Award, a 2016 SIGMOD Research Highlight Award, and a 2016 IBM Pat Goldberg Memorial Best Paper Award.