Democratizing Data Science

The database community often focuses on improving the performance of database systems, which is (and always will be) of critical importance. This continues to drive innovation by unlocking ever larger datasets, challenging workloads, and near-instantaneous results. Yet, people outside the database community struggle to harness the treasure trove of modern database systems: It takes hours and expert knowledge to set up, load, and run workloads; thus wasting time and delaying insights. Therefore, the database community has started widening its focus, as highlighted in a recent industry perspective:

“The important feature of a database is how quickly you can go from idea to answer, not query to result.” – January, 2024, MotherDuck, Jordan Tigani

and at last year’s SIGMOD panel discussion:

“We are asking the world to bring their data into our format. We need to work on the entire data science pipeline!” – June, 2025, SIGMOD Berlin, Sihem Amer-Yahia

In this article, we explore why this new frontier matters and present Dataloom, our research prototype designed to help open the black box of data systems to a broader audience.

New Metrics: Time-To-Insight and Accessibility

The task of a data scientist is to answer questions based on data. Given a dataset, this involves many steps before an answer can be formulated: understanding the data, loading it, cleaning it, writing queries, visualizing results, and usually many iterations in each step. Sharply put, a database that processes terabytes in milliseconds is useless to a user who cannot figure out how to load their data into it. In more technical terms, we can consider Amdahl’s law: If it takes an hour to understand, load, and clean data, the benefit of speeding up the runtime of a 1-second query by 10x is rather limited. In these interactive scenarios, time-to-insight (i.e., finding an answer to a particular question) is critical. Improving this process enables users to answer questions while they are still relevant and allows shifting employee cycles from tedious tasks to the interesting and fun parts of data science.

At the same time, working with data is becoming omnipresent in professional as well as personal life. Sticking with well-known metaphors: Data shares not only the value of oil but also its fluid nature, seeping into all domains and across skill levels. Nowadays, most aspects of an organization (controlling, reporting, monitoring, and research) heavily rely on data. Even individually, people track finances, health, fitness, and more. In both settings, an increasing number of people, especially non-experts, want to work with data. Improving the ease-of-use of well-studied, schema-rich, and scalable data processing systems (i.e., database systems) can thus benefit a broad spectrum of people (even veteran data scientists know how much time is spent wrangling with formats or missing values).

TL;DR: time-to-insight is an important metric for more and more people.

Case Study: Data Engineering Class of 2023

Our Master’s students at UTN have a data engineering class where they learn to build scalable data pipelines. As part of their final project, we ran TPC-H in Amazon Redshift and gave them a dump of all log files in CSV format. Their task was to figure out which workload we ran and prepare a dashboard with their analysis.

The first major hurdle for the students was loading the dataset into an analytical database. They had to figure out which CSV files belonged to which table, infer the schema and column names, create the SQL create table statements, and then load the data into the system. The result: Lots of frustration, copious amounts of “uninspired” boilerplate code for loading, and delayed results.

Enter: Dataloom

On the very same day, we hacked together the first version of Dataloom: A simple prototype that—in its earliest form—combined LLMs and classical algorithms (e.g., CSV parsing, type inference, etc.) to make the loading process seamless and fast. How did it do this? Given a chaotic set of files, we use an LLM to figure out which files belong to which table, how to name columns, and create a natural language description of the table—all tasks that used to require human intervention, as traditional algorithms are very bad at this. We then ran some classical schema detection algorithms and loaded everything into DuckDB. Thus we transformed an hour-long process that stood between students and their weekend into a quick and easy-to-use tool … that worked 80% of the time ;).

Building Dataloom, we quickly realized that LLMs can be unpredictable at times and don’t always churn out the correct results in the intended way. However, we were not daunted but instead inspired to make Dataloom into a system that allows users easy access to data analytics. Initially, we focused on the schema mapping and refinement issue, which we presented in a demo paper at VLDB 2024, which sparked much interest in the attendees. Over the previous year, Dataloom has expanded in its scope from just loading data to cleaning, visualizing, and reporting results—thus encapsulating more and more data science tasks.

Our vision is to build an end-to-end agentic data platform, enabling domain experts to acquire, clean, analyze, and visualize data in a principled manner by combining the benefits of LLMs with decades of database research.

Stay tuned for more posts on our Dataloom website: “Future (agentic) data is not doomed, it will be loomed.”





    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Launching Our Blog And Wrapping Up 2025