RAG — Retrieval Augmented Generation

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Before answering, look stuff up. The technique that lets an LLM use your knowledge — your docs, your codebase, your wiki — without retraining the model.


In plain English

LLMs are trained on a fixed set of text. They know what was in their training data. They don’t know your private documents, your company wiki, last week’s news, or anything that’s happened since the cutoff.

Retrieval Augmented Generation (RAG) is the workaround. Instead of asking the model to answer from its training data alone, you:

  1. Find relevant chunks of your own data (by searching it — usually by meaning, not keywords)
  2. Stuff those chunks into the prompt as context
  3. Ask the model to answer using that context

The model now grounds its answer in your documents. The architecture is simple but the impact is huge: it’s the difference between an LLM that gives generic answers and one that knows your business.

RAG is how:

  • ChatGPT can answer questions about uploaded PDFs
  • Customer support bots answer using the company’s actual docs
  • Claude Code understands your specific codebase (the file-reading tools are a form of retrieval)
  • “AI search” products (Perplexity, You.com) cite real sources
  • Internal company “ask the wiki” assistants work

Why it matters

The LLM market has settled into two main patterns for grounding answers: fine-tuning (retrain the model on your data — expensive, brittle, slow to update) and RAG (retrieve from a database at query time — cheap, fresh, easy to update). RAG won. For most applications where the LLM needs to know your specific information, RAG is the right answer.

Understanding RAG lets you:

  • Recognize when “the AI is hallucinating” is really “you forgot to give it context”
  • Design AI products that answer accurately from your data
  • Evaluate when RAG vs fine-tuning vs longer context is the right choice
  • Read AI architecture diagrams without being lost

The three steps

Step 1: Ingest your data into a searchable index

Before any queries can happen, you have to prepare your data:

Your docs (PDFs, web pages, markdown files, database records, etc.)
       ↓
Split into "chunks" (typically a few hundred tokens each)
       ↓
For each chunk: compute an embedding (a vector)
       ↓
Store {chunk_text, vector, source metadata} in a vector database

Chunking matters. Too small → chunks lack context. Too large → you waste tokens including irrelevant material. Typical chunk sizes are 200–800 tokens, often with overlap (each chunk shares some text with the next, so a concept spanning chunk boundaries isn’t lost).

Embeddings turn each chunk into a vector of ~768 to ~3072 numbers. Two chunks about similar topics have vectors that are mathematically close. See Embeddings.

Vector databases specialize in storing vectors and answering “find me the closest vectors to this one” fast. Popular options: Pinecone, Weaviate, Chroma, Qdrant, Milvus, and Postgres + pgvector (the most common solo-dev choice — your existing Supabase database can do RAG with just an extension).

Step 2: At query time, retrieve relevant chunks

User question: "What's our policy on remote work?"
       ↓
Compute embedding of the question
       ↓
Search the vector DB for the k closest chunks (typically k=3 to k=10)
       ↓
You now have the chunks most semantically similar to the question

The key win over keyword search: semantic similarity captures meaning, not just word overlap. A question about “working from home” can match a document section about “remote arrangements” even though the words don’t overlap.

Step 3: Stuff the chunks into the prompt and generate

System: You are a helpful HR assistant. Answer using only the provided context.
        If the context doesn't contain the answer, say "I don't have that information."

Context:
[chunk 1: policy paragraph about remote work eligibility]
[chunk 2: section on remote work requests and approval]
[chunk 3: appendix about international remote workers]

User question: What's our policy on remote work?

The model now has the relevant context in its working memory and can answer accurately, citing the chunks. The whole flow happens in well under a second.


A concrete example: building a tiny RAG system

For a small company wiki, the whole pipeline:

// === INGESTION (run once when docs change) ===
 
import { OpenAIEmbeddings } from "openai-embeddings"; // any embeddings provider
import { createClient } from "@supabase/supabase-js";
 
const sb = createClient(SUPABASE_URL, SUPABASE_SERVICE_ROLE_KEY);
const embedder = new OpenAIEmbeddings();
 
async function ingestDocument(doc: { id: string, text: string }) {
  const chunks = chunkText(doc.text, { size: 500, overlap: 100 });
  for (const [idx, chunk] of chunks.entries()) {
    const vector = await embedder.embed(chunk);
    await sb.from("doc_chunks").insert({
      doc_id: doc.id,
      chunk_idx: idx,
      text: chunk,
      embedding: vector
    });
  }
}
 
// === QUERY (every user question) ===
 
async function answerQuestion(question: string) {
  const qVector = await embedder.embed(question);
 
  // pgvector similarity search via Supabase RPC
  const { data: chunks } = await sb.rpc("match_chunks", {
    query_embedding: qVector,
    match_count: 5
  });
 
  const context = chunks.map(c => `[${c.doc_id}]: ${c.text}`).join("\n\n");
 
  const answer = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{
      role: "user",
      content: `Use the following context to answer.\n\nContext:\n${context}\n\nQuestion: ${question}`
    }]
  });
 
  return answer.content[0].text;
}

That’s a complete RAG system in ~40 lines. The hard parts in production are everywhere this code glosses over: better chunking, hybrid keyword+vector search, re-ranking, citation tracking, monitoring quality.


Common RAG architectures (good to bad)

🟩 Basic vector RAG (above)

Embed → store → embed-query → top-k → prompt. Works surprisingly well for many use cases. The right starting point.

🟩🟩 Hybrid (vector + keyword)

Combine vector search with traditional keyword search (BM25). Catches cases where exact word match matters (proper nouns, code identifiers, specific product names) that pure vector search would miss.

🟩🟩🟩 With re-ranking

First retrieve a broad set (e.g. top 50), then use a smaller, faster model to re-score them for relevance to the question, keeping the top 5. Higher quality, modest extra cost.

🟩🟩🟩🟩 Agentic RAG

Instead of a single retrieval pass, give the model a “search” tool and let it decide how to query — multiple times, refining as it learns. The model becomes an active researcher. This is what Claude Code does when exploring a codebase.

Multi-query

Generate several variants of the user’s question (rephrased differently), retrieve for each, deduplicate, then answer. Catches queries where the original phrasing missed the right chunks.

Query decomposition

Break a complex question into sub-questions, retrieve and answer each, then synthesize. Useful for multi-hop reasoning (“what’s the relationship between A and C?” → first find A→B, then B→C).


RAG vs longer context vs fine-tuning — when to use what

When…Use
Data is small (< 100K tokens), stableJust put it all in the context window
Data is large or changes oftenRAG
You want the model to be an expert in a domain (style, tone, format)Fine-tuning
You want the model to know specific factsRAG
Mix of bothFine-tuning for tone + RAG for facts

Modern long-context models (Claude with 200K, Gemini with 2M) push the “just put it all in” frontier further. For codebases up to ~1M tokens, “load the whole thing” is increasingly viable. Below that threshold, RAG is overkill.


What makes RAG quality vary so much

In practice, RAG systems range from “magic” to “useless” — and the underlying model is rarely the difference. What matters:

  • Chunking strategy. Bad chunks → bad retrieval. Try paragraph-based chunking, semantic chunking, sliding windows.
  • Embedding model quality. Newer embedding models (OpenAI’s text-embedding-3-large, Cohere’s embed-v3, Anthropic’s, open-source alternatives like nomic-embed-text) are notably better than 2022-era models.
  • Number of chunks retrieved. Too few → missing context. Too many → token bloat + dilution.
  • Hybrid search. Adding keyword search to vector search often gives 10–30% better retrieval quality.
  • Re-ranking. A 200-million parameter re-ranker (much cheaper than the main LLM) can boost quality significantly.
  • Prompt engineering. “Use ONLY the context” vs “use the context as a guide” produces dramatically different outputs.
  • Source citations. Asking the model to cite which chunks it used surfaces hallucinations.
  • Evaluation. RAG systems need ongoing eval — sample queries, judge answers, iterate.

Common gotchas

  • Garbage in, garbage out. RAG over a poorly-organized document base produces poor results. Cleaning and structuring source data is often the highest-leverage improvement.

  • Retrieval failures are silent. If the right chunk isn’t retrieved, the model often just makes something up confidently. Add evaluation and citation requirements to surface this.

  • Embedding cost adds up. Re-embedding a large corpus when you change embedding models is expensive. Choose an embedding model with longevity in mind.

  • Stale chunks. If your source data changes and you don’t re-embed, the index drifts. Build re-ingestion into your update flow.

  • Chunk boundaries can split context. A definition on one chunk and its usage on another. Overlap helps but isn’t perfect. Semantic chunking (split at section boundaries) often works better than fixed-size.

  • Vector search ignores metadata. “Only chunks from the 2026 employee handbook” requires combining vector search with metadata filtering — usually supported by vector DBs but you have to wire it up.

  • Re-asking the same question can give different answers. Different retrieval results, different model sampling. For consistency, cache retrieval results or use lower temperature.

  • Long contexts hide RAG failures. When you stuff 50K tokens of context in, the model may “find” answers in noise. Less is more — retrieve fewer, higher-quality chunks.

  • Hybrid > pure vector for most use cases. Don’t skip the keyword search step.

  • For codebases, just reading files is often better than vector RAG. Claude Code’s approach (give the agent Read, Grep, Glob tools and let it explore) outperforms naive vector RAG for code, because code has strong structural signals (file paths, function names, imports) that vector search dilutes.


See also


Sources