Lakehouse: A New Generation of Platforms for Data Warehousing and Advanced Analytics

Abstract:

The top two challenges for data users in enterprises are data quality and staleness. While building reliable data pipelines is inherently hard, a lot of today’s problems stem from the complex data architectures that organizations deploy. These architectures contain many systems—data lakes, message queues, and data warehouses—that data must pass through, where each transfer step adds delays and a potential source of errors. What if we could remove all these steps? In recent years, cloud storage and new open source systems have enabled a new architecture: the lakehouse, an ACID transactional layer over cloud data lake storage that can provide management features, indexing, and high SQL performance similar to a data warehouse. In addition, because they build on open storage formats and direct file access, lakehouses support AI, data science and streaming workloads that are difficult to run on data warehouses. Thousands of organizations including the largest Internet companies are now using the lakehouse model to replace separate data lake, warehouse and streaming systems. I’ll discuss the key trends and research challenges in this area based on my experience at Databricks and with the open source Delta Lake project.

Bio:

Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley, and has worked on other widely used open source data analytics and AI software including MLflow and Delta Lake. At Stanford, he is a co-PI of the DAWN lab focusing on infrastructure for machine learning. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the US government to early-career scientists and engineers.