Semitora.

29 June 2026

Evaluating RAG — how do you know it answers accurately and from sources

A RAG system can be confident and wrong at the same time. To trust it, you have to measure it — not once, but on every change. RAG evaluation is the set of metrics that separate “sounds reasonable” from “is correct and comes from a source”: retrieval relevance, answer groundedness, citation correctness and refusal correctness. Without them, “it works on my three questions” is not proof, it’s a hunch.

RAG reduces hallucinations but does not switch them off. The question is not “does RAG hallucinate” but “how often, and do we catch it before the customer does”. The answer is evaluation.

Separate the two layers: retrieval and generation

RAG has two stages, and each fails differently. First the retriever fetches passages, then the model writes an answer. If the retriever doesn’t find the right passage, the best model won’t save you. If it does find it and the model makes something up anyway, the problem is in generation. Measure the two layers separately, or you won’t know what to fix.

What to measure — five metrics

How to measure — the method

Why this is ongoing work, not a project

The model updates, documents accumulate, questions evolve. An evaluation that doesn’t keep running ages along with them. That’s why for us RAG evals are part of ongoing care (the retainer), not a one-off sign-off — they are what decides whether quality holds over time. As we wrote about guardrails: without tests and evaluations, a safeguard is decoration. The same is true of RAG.

In short

Measure retrieval and generation separately. Five metrics: retrieval relevance, groundedness, citation correctness, answer relevance, refusal correctness. Build a golden set, calibrate the model-judge against human scores, regress on every change, and keep measuring in production. Then “it works” stops being a hunch and becomes a number.

What next

How we build RAG with sources is on the RAG / knowledge bases page. Evaluations and maintaining quality over time are part of ongoing care. If you already have a RAG system and aren’t sure you can trust it, start with an audit.