RAG: Retrieval-Augmented Generation

TL;DR - A normal LLM only knows its training data. RAG (Retrieval-Augmented Generation) fetches the relevant parts of your documents and puts them in the prompt - so the AI answers from your knowledge, with citations. It's an open-book exam.

Why it matters

RAG is what makes AI genuinely useful for organization-specific knowledge: internal help desks, "ask our docs", product Q&A. It's current, private, and citable - without retraining a model.

How it works (4 steps)

1. Prepare  - split your docs into chunks; store them in a search index
              (often a "vector database" that searches by meaning, not keywords).
2. Retrieve - for a question, find the few most relevant chunks.
3. Augment  - paste those chunks into the prompt as context.
4. Generate - the model answers using them, and can cite which chunk it used.

Worked example

Question: "What's our refund window?"

RAG retrieves the policy chunk ("refunds within 30 days...") -> the model answers from your real policy, not a guess, and can point to the source.

Steal this - keep RAG honest

System instruction: "Answer ONLY from the provided context.
If the answer isn't there, say you don't know. Cite the source."

Common mistakes (and the fix)

Garbage retrieval -> garbage answer. Fix: good chunking + the right index matter more than the model.
No grounding instruction. Fix: force "answer only from context".
Reaching for fine-tuning for a knowledge problem. Fix: RAG is cheaper and updatable for "answer from our docs".

Good to know

You don't have to code it: NotebookLM is RAG you can use today (upload docs, ask grounded questions), and Custom GPTs / Claude Projects let you attach files for the same effect. Builders use vector DBs (Pinecone, pgvector) + frameworks (LangChain, LlamaIndex).