Benchmarking Semantic Query Processing Systems

Semantic database systems bring probabilistic, LLM-based operators into traditionally deterministic relational engines for query processing. This integration introduces a fundamental tradeoff between result quality and execution cost. Yet, existing benchmarks do not evaluate this tradeoff at the system level.

In this post, we present SemBench, a benchmark designed to rigorously assess semantic query processing systems end-to-end.

SemBench fills a gap we have observed in the benchmarking world: it is a benchmark specifically dedicated to multimodal semantic query processing systems. These systems integrate modern multimodal question-answering systems into traditional database engines for query processing. Although there are numerous existing benchmarks in each respective domain, we found none of them to be suitable for these new kinds of systems, motivating the development of SemBench.

The SemBench project is in collaboration with researchers and practitioners from Cornell University, UTN, BIFOLD & TU Berlin, University of Michigan, MIT CSAIL, Vrije Universiteit Amsterdam, and Google: Jiale Lao, Andi Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, and Immanuel Trummer.

What are Semantic Query Processing Systems?

…and why are they needed?

Semantic query processing systems bring the magically flexible data processing capabilities of multimodal LLMs (e.g., GPT-4o) into semantically well-defined traditional database systems by adding new so-called semantic operators.

Current State

Traditional database systems excel at processing data at scale. They rely on their own declarative query language, typically SQL, which has both benefits and limitations: While the semantics are well-defined and allow for a number of impactful optimizations, the set of supported operations is limited by the available functionality of SQL. Numerical data is supported as a first-class citizen; strings usually also work very well. However, when it comes to more complex data types, like images, documents, or even audio files, traditional database systems typically treat them as opaque BLOBs and users are left with the following options:

Take data out of the database, process it externally in application code, and put it back. This loses consistency guarantees, and external processing scripts are usually not optimized for efficient data handling.
If supported by the database system, users can write User-Defined Functions (UDFs) in a different language (e.g., Python) with access to libraries. This works well, but UDFs are notoriously hard for database systems to optimize and are usually treated as black boxes.
Avoid database system altogether.

Options (1) and (3) are undesirable in practice; (2) is workable but suboptimal.

Extending SQL with LLMs

Modern multimodal LLMs are particularly well suited for processing such unstructured BLOB data types. In addition, they enable non-technical users to interact with and analyze this data by expressing their intent directly in natural language without the need of advanced programming skills. Examples include “Summarize the content of this audio-interview” or “Is there an elephant in the picture?”.

Because LLMs can flexibly interpret unstructured inputs with reasonable latency, it is tempting to integrate them into workflows that process large data collections, much like relational databases do for structured data. However, throwing a large table at an LLM is not directly possible due to table sizes typically exceeding the LLM’s context limits; but record-level processing is possible, albeit slow and expensive. In fact, record-level processing is as simple as an API call plus some additional post-processing inside a UDF.

This simple pattern (LLM call plus response parsing) can be formalized into reusable query primitives (Patel et al., 2024), including:

A function AI_FILTER() takes a natural-language predicate and a tuple as context, evaluates the predicate with an LLM and extracts a boolean value from its response, allowing users to filter data. Placed on top of a (cross-)join, this even allows users to join records semantically.
AI_SCORE() rates a tuple based on a natural-language description and emits a numerical score, allowing a database system to sort by that number.
AI_CLASSIFY() assigns class labels to records based on class descriptions, making it usable in GROUP BY clauses (or traditional labeling).
AI_AGG() takes as many tuples as the model’s context window allows and returns an aggregate value, for example a summary across multiple scanned document pages.

These primitives allow for flexible but well-defined queries with LLM operations on rows in a database, unlocking a vast array of new analytical capabilities in database systems. All of these operators are, by nature, fuzzy and might produce different results depending on the LLM being used.

While all of these operations could be implemented as UDFs, a deeper integration into a query engine allows for much more advanced optimizations, like model cascades (Liu et al., 2024), exploitation of item-similarity (Patel et al., 2025), prompt fusion, prompt compression (Liu et al., 2024), and predicate pull-up. The idea of semantic query processing engines is to realize this deep integration, making semantic functions first-class citizens in database systems. Similarly, SemBench is designed to expose the impact of these optimizations.

The following example query demonstrates the flexibility of semantic query processing systems, making use of AI_FILTER, AI_CLASSIFY, AI_SCORE, and AI_AGG:

 SELECT
  -- Classify papers into different areas
  AI_CLASSIFY('Assign the paper {p.file} to one of the following areas', {
    'efficient-filters': 'Papers discussing how semantic filters can be implemented more efficiently',
    'efficient-joins': 'Papers discussing how semantic joins can be implemented more efficiently',
    'other': 'Everything else',
  }) as area,
  -- Summarize state of the art
  AI_AGG('Summarize the research field based on the papers: {p.file}') as summary
FROM LIST_FILES('./arxiv_downloads/*.pdf') as p
WHERE true
  -- Filter papers based on research field
  AND AI_FILTER('Is the paper {p.file} about semantic database engines?')
GROUP BY area
-- Order summaries by how advanced the field appears
ORDER BY AI_SCORE('How advanced is the research field based on the summary: {summary}');
 

Existing Implementations

Surprisingly, despite semantic query processing still being a young topic, there are already a large number of academic research systems—and even industry systems—that support semantic query processing. Among research systems, there is LOTUS (GitHub, paper), Palimpzest (GitHub, paper), ThalamusDB (GitHub, paper), FlockMTL (GitHub, paper), CAESURA (GitHub, paper), and BlendSQL (GitHub, paper). Additionally, major industry players like Snowflake (paper, blog post), BigQuery (blog post), and Databricks (documentation) have already integrated semantic functions in their offerings.

Why SemBench?

To evaluate semantic query engines, we need benchmarks that test both accuracy loss due to optimizations and end-to-end execution efficiency in terms of cost and latency.

A benchmark should help answer the following questions:

Can semantic operators be optimized without significant accuracy degradation?
How does cost scale with dataset size?
How stable are results across models?
What is the impact of specific optimizations?

Traditional database benchmarks like the TPC family or JOB (Leis et al., 2015) focus on large-scale relational data processing, but lack any semantic operations. Extending these benchmarks with semantic operators is also not feasible because the underlying datasets (a) usually lack unstructured data types, including textual descriptions or images, and (b) provide no ground truth to verify the results of semantic queries.

Benchmarks in the ML area, on the other hand, usually focus on assessing the correctness of the answer to a single question: classifying data points, identifying objects in an image, multimodal question answering, etc. Even though these questions are usually repeated thousands of times and averages are reported, there is no focus on techniques that efficiently answer these questions over a large dataset as a whole.

Neither of the two benchmarking worlds is really suitable to assess semantic query processing systems, so the idea of implementing a new benchmark specifically for these systems came up at a Dagstuhl Seminar in 2025. This is how SemBench (Lao et al., 2025) was born.

Dataset and Workloads of SemBench

SemBench consists of five use cases, each with 10-14 queries, resulting in a total of 55 queries comprising 78 semantic operators. Each use case comes with its own dataset and query patterns with a verifiable ground truth.

Use cases cover a range of different modalities, including tabular data, textual descriptions, images, and audio. A breakdown of which use case contains which modalities can be found in the following table (taken from the paper):

Scenario	Table	Text	Image	Audio
Movie	✓	✓	–	–
Wildlife	✓	–	✓	✓
E-Comm	✓	✓	✓	–
MMQA	✓	✓	✓	–
Cars	✓	✓	✓	✓

We also tried to cover as many different specialized semantic operators as possible under the constraint of a verifiable ground truth. Naturally, there is a strong focus on AI_FILTER and also AI_JOIN. A breakdown of the number of different semantic operators per use case can be found in the following table:

Scenario	#Queries	`AI_FILTER`	`AI_JOIN`	`AI_MAP`	`AI_SCORE`	`AI_CLASSIFY`
Movie	10	4	3	–	2	1
Wildlife	10	17	–	–	–	–
E-Comm	14	12	9	3	1	2
MMQA	11	5	3	4	–	–
Cars	10	12	–	–	–	1
Total	55	49	15	7	3	4

We implemented SemBench scenarios for LOTUS, Palimpzest, ThalamusDB, and BigQuery; but we encourage other vendors to contribute their results. To compare results and identify strengths and weaknesses of each system, we set up a leaderboard that we plan to update as new submissions arrive: http://sembench.org/. Note that all of the systems are currently under active development and the reported metrics are merely a snapshot in time.

The reference implementations together with the benchmarking queries and dataset references can be found in the SemBench repository on GitHub: https://github.com/SemBench/SemBench.

Ground Truth vs. “Gold Standard Ground Truth”

During the development of the benchmark, we faced the question of what to base the ground truth on in order to compute query accuracy. In general, there are two common approaches:

In the machine learning domain, ground truth is usually “absolute” ground truth: results are compared against labels that have been determined by humans. The advantage is that model performance can be evaluated in absolute terms for specific tasks.
Another option is to compare results to a “gold standard” model: the best available model used as a baseline for these operators. This yields relative accuracy, measuring whether optimizations stay close to baseline quality while reducing cost (e.g., ~99% relative accuracy). Its key advantage is broad applicability, since it requires no human-labeled ground truth and works for any dataset or query.

Both options are viable, but the latter has the disadvantage of being a moving target: improvements in ML models could render earlier relative accuracy scores invalid. Therefore, we opted for absolute ground truth in SemBench.

Call for Contributions

With this post, we not only introduce SemBench but also invite contributions to the online leaderboard.

The landscape of semantic query processing systems is evolving rapidly: new systems continue to emerge, and existing ones are improved at a fast pace. To ensure that SemBench remains representative and up to date, we encourage system developers to implement the benchmark for their own systems and submit their results.

We are currently formalizing a structured contribution process. In the meantime, researchers and practitioners interested in contributing results are welcome to contact us directly.

UTN’s Research Roadmap

During the work on SemBench, we observed that semantic operators are not yet practical on large-scale datasets due to their extraordinarily high monetary costs and slow processing times. We decided to tackle this limitation, making it a core research focus of our own semantic query processing engine Spectra.