Brussels / 31 January & 1 February 2026

schedule

How to Prevent Your AI from Returning Garbage: It Starts and Ends with Data Engineering


Your AI application returns wrong answers. Not because of your LLM choice or vector database—but because of the data engineering ( or lack there of) nobody wants to talk about.

This technical deep dive shows why embedding models, chunking strategies, and search filtering have more impact on AI accuracy than switching from one model to another. Using real production data, we'll demonstrate how naive vector search returns Star Trek reviews when users ask about Star Wars, how poor chunking strategies lose critical context (Who want's their AI to respond to how to fix a headache with a head transplant?), and why "just use a vector" without proper data engineering guarantees hallucinations.

We'll cover:

  • Embedding model selection: dimensions, token limits, and silent truncation failures
  • Chunking strategies: when to chunk, how to preserve context, and the double-embedding approach
  • Hybrid search: combining Full Text/BM25 keyword matching with vector similarity
  • Filtering architecture: pre-filter vs post-filter performance trade-offs
  • Production gotchas: triggers, performance, batch processing, and cold start problems

While many of the examples will be for PostgreSQL, This is talk will be database-agnostic, no matter if you are using PostgreSQL, MariaDB, ClickHouse, or others you will learn something! In AI Land, the hard problem is always data engineering, not database selection.

Users don't care about inference speed—they care about accuracy. This talk shows how to engineer your data pipeline so your AI doesn't lie.

Speakers

Photo of Matt Yonkovit ( The Yonk ) Matt Yonkovit ( The Yonk )

Links