RumbleDB for teaching
We have developed the RumbleDB engine, used in the Big Data and Big Data for Engineers lectures to teach querying large, nested, heterogeneous datasets at scale.
The growth in data experienced in the past 20 years has largely brought relational database management systems to their limits: today, much of the data is not fully structured as tables or even as cubes. Rather, a lot of data is semi-structured. Such denormalized data, which does not fit in SQL products, is more appropriately described as heterogeneous collections of trees.
Furthermore, denormalized data is not necessarily stored in fully managed databases or data stores. It is increasingly "dropped" in data lakes (such as S3, HDFS, Azure blob storage, etc) for later processing.
We are committed to formally teaching our students -- in Computer Science, in Data Science but also across many ETH departments -- how to process, store and query denormalized data, whether in data stores or in data lakes, whether locally or on a cluster of machines, without sacrificing any of the fundamental principles of databases set out in the 1970s, in particular data independence.
RumbleDB was designed specifically to support our teaching. It is an engine that implements the JSONiq language (also co-designed by a member of the Systems Group) and maps its execution seamlessly to Apache Spark, smartly distributing the workload on a cluster or even on all the cores of a single laptop.
As JSONiq is a functional and declarative language, this allows us to focus our students on data modelling, validation and formal, high-level querying while protecting them from the distraction of low-level physical execution details that they would otherwise have to deal with with DataFrame APIs or the use of Apache Spark in host languages (Java, Scala, Python).
RumbleDB has reached a level of maturity high enough that we are now using it actively in computer-based exams for the two courses.
Teaching material
Complete recording of the external page Big Data for Engineers lecture (broad, non-CS audience)
Complete recording of the external page Big Data lecture (Computer Science and Data Science audience)
Publicly available external page exercise material (including practical exercises on the Azure platform, Jupyter notebooks, etc) of the Big Data lecture
Slides of the Big Data for Engineers lecture (you are welcome to use them for your own courses as long as you give appropriate credits)
Publications
Contributors
Founders
- Ghislain Fourny
- Stefan Irimescu
- Gustavo Alonso
Current contributors
- Ingo Müller
- Dan-Ovidiu Graur
- Elwin Stephan
- Can Berker Çıkış
- Pierre Motard
Past contributors
- Mario Arduini
- Stefan Irimescu
- Renato Marroquin
- Rodrigo Bruno
- Falko Noé
- Ioana Stefan
- Andrea Rinaldi
- Stevan Mihajlovic