Blog | UTN Data Systems

Text Compression Through the Looking Glass

Cold and unstructured text has long been a storage burden, driving costs for data that is unlikely to ever be accessed again. The rise of accessible large language models (LLMs) has intensified this challenge by dramatically increasing the volume of generated content that still needs to be retained, e.g. for compliance reasons. This post explores a new class of LLM-based compression methods that can significantly reduce the storage footprint of text-heavy data, and explains why LLMs are particularly well-suited to text compression.

12 min read · May 11, 2026

2026
String Fingerprints

Cloud data warehouses are text-heavy. As the amount of text data to scan increases, queries become slower, therefore query engines require fast pre-filters to accelerate them. We present string fingerprints, a lightweight secondary index structure designed to approximate LIKE predicates, albeit with false positives. Fingerprints can be optimized for specific workloads using mixed-integer optimization and even generalize to unseen table filters.

5 min read · March 23, 2026

2026
Benchmarking Semantic Query Processing Systems

Semantic query processing is emerging as a new layer atop relational engines, elevating LLM-backed semantic operators to first-class SQL primitives for multimodal data. We present SemBench, the first benchmark to rigorously evaluate these systems end-to-end, and outline our roadmap towards our own system, Spectra, to make semantic operators affordable at scale.

13 min read · February 16, 2026

2026
Democratizing Data Science

Our vision is to build an end-to-end agentic data platform, enabling domain experts to acquire, clean, analyze, and visualize data in a principled manner by combining the benefits of LLMs with decades of database research.

5 min read · January 16, 2026

2026
Launching Our Blog And Wrapping Up 2025

I'm super excited to launch our blog! We'll use this space to share what's happening in our lab, from research papers and systems to the day-to-day life of our team. To kick things off, let's look back at 2025.

4 min read · December 31, 2025

2025

Text Compression Through the Looking Glass

String Fingerprints

Benchmarking Semantic Query Processing Systems

Democratizing Data Science

Launching Our Blog And Wrapping Up 2025