TL;DR - A normal LLM only knows its training data. RAG (Retrieval-Augmented Generation) fetches the relevant parts of your documents and puts them in the prompt - so the AI answers from your knowledge, with citations. It's an open-book exam.
Why it matters
RAG is what makes AI genuinely useful for organization-specific knowledge: internal help desks, "ask our docs", product Q&A. It's current, private, and citable - without retraining a model.
How it works (4 steps)
1. Prepare - split your docs into chunks; store them in a search index
(often a "vector database" that searches by meaning, not keywords).
2. Retrieve - for a question, find the few most relevant chunks.
3. Augment - paste those chunks into the prompt as context.
4. Generate - the model answers using them, and can cite which chunk it used.
Worked example
Question: "What's our refund window?"
RAG retrieves the policy chunk ("refunds within 30 days...") -> the model answers from your real policy, not a guess, and can point to the source.
Steal this - keep RAG honest
System instruction: "Answer ONLY from the provided context.
If the answer isn't there, say you don't know. Cite the source."
Common mistakes (and the fix)
- Garbage retrieval -> garbage answer. Fix: good chunking + the right index matter more than the model.
- No grounding instruction. Fix: force "answer only from context".
- Reaching for fine-tuning for a knowledge problem. Fix: RAG is cheaper and updatable for "answer from our docs".
Good to know
You don't have to code it: NotebookLM is RAG you can use today (upload docs, ask grounded questions), and Custom GPTs / Claude Projects let you attach files for the same effect. Builders use vector DBs (Pinecone, pgvector) + frameworks (LangChain, LlamaIndex).