Learning to be a Statistician: Learned Estimator for Number of Distinct Values
Abstract:
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems, such as columnstore compression and data profiling. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. Such efficient estimation is critical for tasks where it is prohibitive to scan the data even once. Existing sample-based NDV estimators do not have robust performance across different datasets. One of the most important challenges is that the space of closed-form estimators considered for NDV is restrictive; on the other hand, there are negative results on how sample-based estimators perform in the worst case.
We propose to formulate the NDV estimation task as a supervised learning problem, and aim to learn a (deep) model as the estimator. To this end, we need to answer several questions: i) what the features should be; ii) where we can obtain the training data; iii) how to make the model aware of negative results on estimation accuracy in theory. We introduce how to learn such a workload-agnostic estimator, which is trained on synthetic datasets, however, a bit surprisingly, works well on real-world datasets. Thus, the model/estimator can be trained outside the database, and then deployed into database systems simply as, e.g., user-define functions (UDFs), to offer efficient (within microseconds) and accurate NDV estimations for unknown tables and workloads. We believe our approach can be also generalized for estimating other metrics.
Short Bio:

Dr. Bolin Ding is a research scientist in the Data Analytics and Intelligence Lab at Alibaba Group. He completed his Ph.D. in Computer Science at University of Illinois at Urbana-Champaign, M.Phil. in Systems Engineering and Engineering Management at The Chinese University of Hong Kong, and B.S. in Math and Applied Mathematics at Renmin University of China. Prior to joining Alibaba, he worked as a researcher in Microsoft Research. His research focuses on the management and analytics of large-scale data. More recently, he works on topics related to SysML (System + ML) and EconML (Economics + ML) with interdisciplinary challenges arising from data platforms and eCommerce platforms.