Towards Automated Validation and Inspection of Machine Learning Pipelines

ABSTRACT:

Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. I will introduce some of the practical problems in this area and give an overview over two recent approaches on tackling such issues.
Deequ is a library for automating the verification of data quality at scale. It provides a declarative API, which combines common quality constraints with user-defined validation code, and thereby enables ‘unit tests’ for data. Deequ efficiently executes the resulting constraint validation workload by translating it to aggregation queries on Apache Spark, and also supports the incremental validation of data quality on growing datasets.
mlinspect is a library that enables the lightweight lineage-based inspection of ML preprocessing pipelines. The key idea is to extract a directed acyclic graph representation of the dataflow from ML preprocessing pipelines in Python, and to use this representation to automatically instrument the code with predefined inspections based on a lightweight annotation propagation approach. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation.

SHORT BIO:

Sebastian Schelter is an Assistant Professor with the University of Amsterdam, conducting research at the intersection of data management and machine learning. He is affiliated with the Intelligent Data Engineering Lab and manages the AI for Retail Lab. His work addresses data-related problems that occur in the real world application of machine learning. Examples are the automation of data quality validation, the inspection of machine learning pipelines via code instrumentation, or the design of machine learning applications that can efficiently forget data. Most of his research is accompanied by efficient and scalable open source implementations, many of which are applied in real world use cases, for example in the Amazon SageMaker Model Monitor service. In the past, he has been a Faculty Fellow with the Center for Data Science at New York University and a Senior Applied Scientist at Amazon Research, after obtaining his Ph.D. at the database group of TU Berlin with Volker Markl. He is active in open source as an elected member of the Apache Software Foundation, and has extensive experience in building real world systems from his time at Amazon, Twitter, IBM Research, and Zalando.