Evaluating RAG — how do you know it answers accurately and from sources

A RAG system can be confident and wrong at the same time. To trust it, you have to measure it — not once, but on every change. RAG evaluation is the set of metrics that separate “sounds reasonable” from “is correct and comes from a source”: retrieval relevance, answer groundedness, citation correctness and refusal correctness. Without them, “it works on my three questions” is not proof, it’s a hunch.

RAG reduces hallucinations but does not switch them off. The question is not “does RAG hallucinate” but “how often, and do we catch it before the customer does”. The answer is evaluation.

Separate the two layers: retrieval and generation

RAG has two stages, and each fails differently. First the retriever fetches passages, then the model writes an answer. If the retriever doesn’t find the right passage, the best model won’t save you. If it does find it and the model makes something up anyway, the problem is in generation. Measure the two layers separately, or you won’t know what to fix.

What to measure — five metrics

Retrieval relevance. Is the passage that contains the answer among the ones fetched? (recall@k, hit rate). Without this, the rest doesn’t matter.
Groundedness. Does every sentence of the answer follow from the fetched passages, rather than from the model’s “memory”? This is a direct measure of hallucination.
Citation correctness. Does the cited source actually contain what the model attributed to it? A citation that doesn’t support the sentence is worse than no citation — it builds false trust.
Answer relevance. Does the answer address the question, or just sit next to the topic?
Refusal correctness. Does the system say “I don’t know” when the knowledge base has no answer, instead of making one up? It’s the metric easiest to skip and most expensive to ignore.

How to measure — the method

A golden set. 50–200 real questions with an expected answer and source. This is your regression test: you build it once and use it forever.
Judge: human first, then model. Some metrics (groundedness, relevance) can be scored by a model (“LLM-as-judge”), but the judge is fallible too — calibrate it against a human-scored sample before you trust it.
Regress on every change. Changing the model, the prompt, the chunking, or refreshing the knowledge base can fix one question and break ten. Without a golden set you won’t see it until the customer does.
Monitor in production. Data and questions shift — what passed tests in March can drift by June. Keep measuring after launch: refusal rate, the distribution of retrieved sources, user reports.

Why this is ongoing work, not a project

The model updates, documents accumulate, questions evolve. An evaluation that doesn’t keep running ages along with them. That’s why for us RAG evals are part of ongoing care (the retainer), not a one-off sign-off — they are what decides whether quality holds over time. As we wrote about guardrails: without tests and evaluations, a safeguard is decoration. The same is true of RAG.

In short

Measure retrieval and generation separately. Five metrics: retrieval relevance, groundedness, citation correctness, answer relevance, refusal correctness. Build a golden set, calibrate the model-judge against human scores, regress on every change, and keep measuring in production. Then “it works” stops being a hunch and becomes a number.

What next

How we build RAG with sources is on the RAG / knowledge bases page. Evaluations and maintaining quality over time are part of ongoing care. If you already have a RAG system and aren’t sure you can trust it, start with an audit.