Data Processing in the Cloud

We are exploring the architecture of data processing in the cloud and how native engines should look in the fast-evolving computing platforms used in modern data centers and cloud systems. Our efforts cover both infrastructure aspects, such as improving current serverless offerings to the architecture of cloud-native engines and including open-source engines for processing semi-structured data in JSON.

Projects

Boxer
Evaluating query languages and systems for high-energy physics data
Lambada
Modularis
Query Sharing
RumbleDB
Semi-Structured Data Querying on Fully-Managed Cloud Databases

Boxer

Wawrzoniak, Michal Müller, Ingo, Dr. external page Bruno, Rodrigo, Prof. Klimovic, Ana, Prof. Dr.

Serverless is an attractive platform for various applications in the cloud due to its promise of elasticity, low cost, and fast deployment. Instead of using traditional virtual machine services and a fixed infrastructure, which incurs considerable costs to operate and run, Function-as-a-Service allows triggering fast computations on demand with the price proportional to the time the functions are running. As appealing as the idea is, recent work has shown that for data processing applications (whether OLTP, OLAP, or ML), existing serverless platforms are inadequate and additional services are needed in practice, often to address the lack of communication capabilities between functions. In this paper, we demonstrate how to enable function-to-function communication using conventional TCP/IP and show how the ability to communicate can be used to implement data processing on serverless platforms in a more efficient manner than it was possible until now. Our benchmarks show a speedup as high as 11 × in TPC-H queries over systems that use cloud storage to communicate across functions, sustained function-to-function throughput of 621 Mbit/s, and a round-trip latency of less than 1 ms.

Evaluating query languages and systems for high-energy physics data

Graur, Dan-Ovidiu Müller, Ingo, Dr. Fourny, Ghislain, Dr. external page Proffitt, Mason external page Watts, Gordon

In high-energy physics (HEP), query languages, in general, and SQL, in particular, have found limited acceptance. This is surprising since HEP data analysis matches the SQL model well: the data is fully structured and queried mainly using standard operators. We comprehensively analyze six diverse, general-purpose data processing platforms using a HEP benchmark to gain insights into why this is the case. The result of the evaluation is an exciting and rather complex picture of existing solutions: Their query languages vary significantly in how natural and concise HEP query patterns can be expressed. Furthermore, most of them are also between one and two orders of magnitude slower than the domain-specific system used by particle physicists today. These observations suggest that, while database systems and their query languages are, in principle, viable tools for HEP, significant work remains to make them relevant to HEP researchers.

See it on external page GitHub

Lambada

Müller, Ingo, Dr. external page Marroquín, Renato

Serverless computing has recently attracted much attention from research and industry due to its promise of ultimate elasticity and operational simplicity. However, there is no consensus yet on whether or not the approach is suitable for data processing. In this paper, we present Lambada, a serverless distributed data processing framework designed to explore how to perform data analytics on serverless computing. We address several critical technical questions that need to be solved to support data analytics and present examples from several domains where serverless offers a cost and performance advantage over existing solutions. In our analysis, supported by extensive experiments, we show which scenarios serverless make sense from an economic and performance perspective.

Modularis

Koutsoukos, Dimitrios Müller, Ingo, Dr. external page Marroquín, Renato Klimovic, Ana, Prof. Dr.

The enormous quantity of data produced daily and advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt—which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis, an execution layer for data analytics based on sub-operators, i.e., composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the feasibility and advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code to maintain. Changes in the platform affect only those sub-operators that depend on the underlying hardware (in our use cases, mainly the sub-operators related to network communication). We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems by proving a modular design's design and architectural advantages can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a broader range of join variants and group by queries, all of which are not supported in the hand-tuned join.

Query sharing

external page Marroquín, Renato Müller, Ingo, Dr. external page Makreshanski,Darko

Cloud-based data analysis is common practice because of the lower system management overhead and the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to reduce the overhead significantly, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as running a single one. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and BigQuery than using a query-at-a-time approach while achieving higher throughput.

RumbleDB

Fourny, Ghislain, Dr. external page Dao, David external page Cikis, Can Berker external page Irimescu, Stefan external page Zhang, Ce, Prof. Dr.

With the ever increasing amount of semi-structured data being produced in the world, novel approaches towards evaluating and processing large volumes of such data are necessary. RumbleDB is a querying engine developed in-house at ETH Zürich that allows you to do just that: query your large, messy datasets with ease and productivity using the JSONiq query language. RumbleDB supports JSON-like datasets in file formats such as: JSON, JSON Lines, Parquet, Avro, SVM, CSV, ROOT as well as text files. RumbleDB can process datasets of any size from kB to TB (and even beyond). It runs on many local or distributed filesystems such as HDFS, S3, Azure blob storage, and HTTP (read-only), and of course your local drive as well. You can use any of these file systems to store your datasets, but also to store and share your queries and functions as library modules with other users, worldwide or within your institution. The core of RumbleDB lies in JSONiq's FLWOR expressions, the semantics of which naturally map to DataFrames and Spark SQL. Likewise expression semantics are seamlessly translated to transformations on RDDs or DataFrames, depending on whether a structure is recognized or not. Transformations are not exposed as function calls, but are completely hidden behind JSONiq queries, giving the user the simplicity of an SQL-like language, and the flexibility needed to query heterogeneous, tree-like data that does not fit in DataFrames.

See it on external page GitHub

Semi-Structured Data Querying on Fully-Managed Cloud Databases

Graur, Dan-Ovidiu Fourny, Ghislain, Dr.

Fully-managed cloud databases have the advantage of providing high degrees of scalability and performance while removing the burden of deploying and managing the database system from the practitioner. Cloud-native database systems are generally built with relational data in mind. However, they offer promising avenues for research into how semi-structured heterogeneous data can fit into this landscape. In this line of work, we are researching exciting ways to bridge the gap between cloud computing and semi-structured data.