Data readiness for RAG — a checklist before you deploy AI

Many enterprise AI projects don’t fail on the model — they fail on the data. Data readiness for RAG is the state where your documents are complete, clean, governed by permissions and kept up to date — fit to be the knowledge base your AI answers from, with a source. The model is rarely the edge; what matters is what you feed it. Walk this checklist — 25 questions across six areas — before you build, not when the prototype is already hallucinating.

If you’re still working out what RAG is, start with RAG on company documents. Below we assume you know what it’s for, and you’re asking the next question: is my data ready for it.

1. Sources and scope

Before you index anything, you need to know what you’re indexing and where it lives.

Which documents should actually answer questions? Policies, procedures, proposals, technical docs — list the specific sets, not “all the company’s knowledge”.
Where do those documents physically live? SharePoint, a network drive, email, the ERP, one person’s head — each source is a separate integration and a separate risk.
Is there a single source of truth per topic? If the same procedure lives in five versions in five places, RAG will cite one of them — not necessarily the current one.
Who owns each set of documents? With no owner, no one can say whether a document is current or whether it may be cited.
What are you deliberately leaving out of scope? Working notes, old versions, private folders. No boundary means the model answers from something you didn’t want cited.

2. Permissions and sensitive data

This is the area that most often derails a project after the fact — and is the hardest to fix once you’re live.

Who is allowed to see which document? A RAG system that ignores permissions will show a salesperson the HR file. The system will expose whatever the retrieval pipeline is allowed to fetch.
Do the documents contain personal data or secrets? ID numbers, patient data, contract terms — these affect risk assessment, the basis for processing, retention, access architecture and AI Act classification.
Do you have a basis for processing and retaining personal data? Adding a document to a RAG base is another processing purpose — it needs a basis and a defined retention period.
Can you filter answers per user role? If not, the only safe boundary is to narrow the knowledge base — before an incident, not after.
Where does the data stay? In production RAG the answer must have a source, and the data shouldn’t flow into public models or their training.

3. Document quality and structure

The model is only as good as the chunk it gets. Garbage in, garbage in the citation.

Are the documents text, or scans? A scan with no text layer has to go through OCR first — otherwise, to RAG, it’s a picture with no content.
Can the content be split into meaningful chunks? Tables, forms and 200-page PDFs with no headings chunk badly and lose context.
Are there duplicates and contradictions? Two versions of the same policy that disagree are a guarantee of inconsistent answers.
Does the content carry its context, or assume it? Acronyms, internal names and “the usual way” are obvious to the team, not to the model.
Are the documents in the languages the system must answer in? A Polish base and English questions is a common cause of empty or weak retrieval.

4. Freshness and versioning

A knowledge base isn’t a one-day snapshot. Data that doesn’t refresh ages faster than you think.

How often does the content change? A weekly price list and a yearly policy need a different re-indexing rhythm.
Who updates the base after go-live, and how? With no process owner, RAG will cite the state from launch day — indefinitely.
Is the document’s date and version visible? Without it, neither a human nor the model can tell a current procedure from an archived one.
How do you retire a document from the index when it expires? Deleting a file at the source isn’t the same as removing it from the vector store — without that, RAG keeps citing stale content.

5. Tests and quality metrics

Without tests, “it works” is a hunch, not a fact. Building isn’t enough — you have to measure.

Do you have a list of real questions the system must answer? That’s the seed of a golden set: question, expected answer, source.
What should the system do when it doesn’t know? A correct refusal (“I don’t know, check with…”) is a quality trait, not a failure.
Who decides whether an answer is accurate and sourced? How to measure RAG quality is its own topic: evaluating RAG. The point here is to have the material to build that test from.

6. Cost and maintenance

The most expensive part of GenAI is usually not inference but data engineering — and it doesn’t end at go-live.

Who maintains the data pipeline after launch? Tidying sources is ongoing work, not a one-off import.
Are you measuring cost and quality over time? What GenAI actually costs in production, on real data: what GenAI really costs.
Are you starting with a slice or the whole thing? A PoC on one well-ordered set will tell you more than indexing everything at once.

In short

Data readiness for RAG is checked across six areas: sources and scope, permissions and sensitive data, document quality and structure, freshness and versioning, quality tests, and cost and maintenance. If you answer “I don’t know” to most of the questions, that isn’t a reason to drop AI — it’s the first phase of the project. The cheapest time to find out is before you build, not after.

What next

How we build RAG on company documents — with sources, on AWS — is on the RAG / knowledge bases page. Tidying data (ETL) and the knowledge base are a distinct delivery step for us, described in how we work. If you don’t know where to start, start with an audit: we’ll walk this checklist on your data.