Fine-tuning vs context (vs RAG vs prompt engineering)
Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Four ways to make an LLM “know” your stuff. Long context, prompt engineering, RAG, fine-tuning. Different costs, different fits. Almost always the answer is “not fine-tuning.”
In plain English
When you want an LLM to behave specifically for your use case — know your company’s docs, follow your style guide, use your jargon, answer in your tone — you have four main techniques:
- Prompt engineering — write a great prompt. Free. Instant. Limited.
- Long context — paste everything relevant into the prompt. Easy. Per-request cost. Limited by context window size.
- RAG (Retrieval Augmented Generation) — store your docs in a database; pull the relevant chunks at query time; include them in the prompt. Scales. Updates easily. More moving parts.
- Fine-tuning — retrain the model on your data. Permanent. Expensive upfront. Hard to update. Powerful when right.
People often reach for fine-tuning first because it sounds most impressive. It’s almost always the wrong starting point. Prompt engineering and RAG combined cover most use cases at lower cost and faster iteration.
This entry untangles the four, explains when each fits, and gives you a decision framework.
Why it matters
- The wrong technique wastes money and time. Fine-tuning a model when prompt engineering would work is overkill. RAG when context-stuffing would suffice adds complexity for no win.
- The right technique unlocks capabilities cheaply. Knowing the trade-offs lets you reach for the simplest thing that works.
- The choice changes how you architect. Fine-tuning means a custom model. RAG means a vector database. Long context means a long prompt every call. These choices ripple through your system.
The four techniques in depth
1. Prompt engineering
You don’t change the model. You don’t add data. You just write a better prompt: clearer instructions, examples, formatting requirements, “respond as a senior engineer,” etc.
Pros:
- Free
- Instant — no setup
- Iterates in seconds
- Works with any model
Cons:
- Limited to what fits in the prompt
- No persistent memory across calls
- Can’t really teach the model new facts
Use when: the task is fundamentally about clearer instructions, formatting, or persona. See Prompt engineering.
2. Long context
Modern LLMs have huge context windows — 200K tokens for Claude Opus, up to 1–2M tokens for some Gemini models. You can stuff the entire relevant material into the prompt.
System: You are a customer support agent. Use the company's full
support documentation below to answer the user's question accurately.
<documentation>
[Pages and pages of docs — say 50K tokens of them]
</documentation>
User: How do I cancel my subscription?
Pros:
- Trivial to implement
- Easy to update (just change the docs)
- Works with any model that has the context length
Cons:
- Costs scale with prompt size on every call (mitigated by prompt caching)
- Quality may degrade for very long contexts (“lost in the middle” effect)
- Limited by the actual context window
Use when: your reference material is under ~50–100K tokens AND you’ll query it often (cache it) OR rarely (cost is one-shot).
3. RAG (Retrieval Augmented Generation)
You store your documents (broken into chunks) in a vector database. At query time, find the most relevant chunks via semantic search, then include those chunks in the prompt.
Pros:
- Scales to arbitrarily large knowledge bases
- Cheap to update (re-index changed docs)
- Citations are natural (you know which chunks were retrieved)
- Cost per query is bounded by retrieval size, not corpus size
Cons:
- More infrastructure (vector DB, embedding model, retrieval logic)
- Quality depends on retrieval quality
- Bad chunks → bad answers
- Setup takes longer than long context
Use when: corpus exceeds the context window, OR you need to query specific subsets, OR you want easy updating.
Full deep-dive: RAG.
4. Fine-tuning
You take a base model and continue training it on your specific data. The model’s weights actually change. After fine-tuning, the model intrinsically “knows” the patterns in your data.
Pros:
- Can teach the model new style, tone, voice
- Faster inference (no need to include training material in prompt)
- Can sometimes outperform RAG for stylistic / pattern tasks
- Lower per-query cost (smaller prompts)
Cons:
- Expensive upfront (training compute + data prep)
- Slow to iterate (each retraining is a project)
- Hard to update (new facts → retrain or supplement with RAG)
- Limited model availability (only some providers offer fine-tuning; Claude has limited fine-tuning availability as of 2026)
- Can degrade general performance
- Requires real expertise and data hygiene
Use when: style/tone matters more than facts, AND you have lots of high-quality training examples, AND you have the budget and expertise.
A decision framework
When you need an LLM to handle your specific case:
Start: Can prompt engineering alone get you to "good enough"?
├── Yes → Stop. Use prompt engineering.
└── No → Next question
Does your reference material fit in the context window
(with caching for cost)?
├── Yes → Use long context.
└── No → Next question
Is the corpus large but the per-query relevant subset small?
├── Yes → Use RAG.
└── No → Next question
Is the challenge mostly STYLE / TONE / PATTERN, not facts?
├── Yes → Consider fine-tuning.
└── No → Use RAG (perhaps with better retrieval).
Are you sure you need fine-tuning?
├── Probably not → Try RAG harder first.
└── Yes → Fine-tune (and probably also use RAG for facts).
The honest answer in 2026: most production AI products combine prompt engineering + RAG. Fine-tuning is for specific, mature use cases.
Real-world examples
Customer support bot
- Knowledge base → RAG over your docs
- Tone/voice → prompt engineering (“You are a friendly senior support agent”)
- Routing/escalation → tool use
- No fine-tuning needed for most cases
Code assistant for your specific codebase
- Codebase context → file tools (Claude Code-style “explore the code”) OR RAG over symbols
- Coding style → prompt engineering (“Follow these conventions: …”) plus a few-shot example file
- No fine-tuning needed
Style-specific content writer (mimicking a specific author’s voice)
- Persona → prompt engineering may suffice
- If style is highly specific and prompting falls short → fine-tuning might pay off
- Facts → long context or RAG
- One of the rare fine-tuning fits
Translating into a specific dialect
- Examples in prompt → may work
- For consistency at volume → fine-tuning makes sense
- Another reasonable fine-tuning fit
Medical / legal / domain expert assistant
- Authoritative knowledge → RAG over verified sources
- Domain reasoning → frontier model + good prompt
- Citations → RAG with source tracking
- Fine-tuning unnecessary if you trust the base model’s reasoning
Combining techniques
These aren’t mutually exclusive. Many serious systems combine:
- Prompt engineering + RAG — the standard production combo
- Long context + RAG — fine-grained retrieval, broad context for orientation
- Fine-tuning + RAG — fine-tuned for style/format, RAG for facts
- All four — for the most demanding production systems
The trick is using each for what it’s best at:
- Facts → RAG
- Style/tone → prompt or fine-tuning
- Orientation → long context (e.g. a project overview always in prompt)
- Instructions → prompt engineering
Cost comparison (rough)
For a customer support bot answering 10,000 queries/month:
| Approach | Monthly cost (rough) | Setup time |
|---|---|---|
| Prompt engineering only | $30-100 | Hours |
| Prompt + long context (cached) | $50-200 | A day |
| Prompt + RAG | $50-300 + vector DB cost | A week |
| Fine-tuned model | $5000+ training, then per-query | Weeks-months |
Numbers vary wildly. But the order is consistent: prompt < long context < RAG < fine-tuning.
When fine-tuning IS the right answer
Genuinely valid fine-tuning use cases:
- You need a smaller, cheaper model to perform like a bigger one on a specific task. Distill knowledge.
- The task is highly stylistic and prompting can’t capture the patterns. (Lyric generation in a specific artist’s voice. Medical SOAP notes in a specific format.)
- The task is repetitive at huge volume and fine-tuning amortizes. (Categorizing millions of support tickets.)
- Compliance requires it. (On-prem fine-tuned model that never sends data out.)
- You have thousands of high-quality examples showing exactly the input→output you want.
If your use case doesn’t clearly fit one of these, start with RAG.
Common gotchas
-
“Fine-tune to give the model new facts.” Mostly doesn’t work well. Fine-tuning teaches patterns and styles, not facts. Facts go in RAG.
-
Fine-tuning a small model to compete with a frontier model. Sometimes works for narrow tasks; rarely for broad reasoning. The frontier model’s general capabilities are hard to beat.
-
Underestimating data prep. Fine-tuning is 10% training, 90% getting clean labeled examples. Most fine-tuning projects fail at data, not training.
-
RAG with no chunking strategy. Throwing whole documents at the vector DB gives bad retrievals. Chunk thoughtfully.
-
Long context with bad organization. Stuffing 100K tokens of disorganized docs ≠helpful. Structure matters.
-
Not measuring quality. Without evaluation, you can’t tell if your fine-tuning / RAG / prompts are actually better than the baseline.
-
Fine-tuning then never updating. Models go stale. Your docs change. Fine-tuned models forget the new stuff unless you re-train.
-
Choosing fine-tuning for “personalization.” Per-user personalization is much better done via context (load the user’s data into the prompt) than per-user fine-tuning.
-
Ignoring prompt caching for long context. Without caching, you pay full input price every time. With caching, the static portion is 10% cost.
-
Mistaking faster inference for better outcomes. A fine-tuned smaller model is faster but may produce worse answers. Measure end-to-end quality, not just latency.
-
“My fine-tuning is overfitting / over-confident.” Common. A model fine-tuned only on positive examples becomes excessively positive. Need negative examples too.
-
Hybrid systems’ complexity. Combining all four techniques in one system is powerful but creates a lot of moving parts. Make sure the wins justify the complexity.
See also
- What is an LLM? đźź©
- Tokens & context windows đźź©
- Prompt engineering đźź©
- RAG — Retrieval Augmented Generation 🟩
- Embeddings 🟩 — underlying RAG
- The Claude API 🟩 🟦
- Claude models 🟩 🟦
- Multimodal 🟩 — what about images / audio?
- Tool use 🟩 — orthogonal but often combined
- Agents đźź©
- Glossary: RAG, LLM
Sources
- Anthropic — When to use RAG vs long context vs fine-tuning — Contextual Retrieval announcement covers the trade-offs
- OpenAI — Fine-tuning guide
- Hugging Face — PEFT (Parameter Efficient Fine-Tuning) — modern fine-tuning techniques
- LangChain — Conceptual guide on RAG
- Prompt engineering vs fine-tuning vs RAG — comparison articles — running discussion in the field