Fine-tuning vs context (vs RAG vs prompt engineering)

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Four ways to make an LLM “know” your stuff. Long context, prompt engineering, RAG, fine-tuning. Different costs, different fits. Almost always the answer is “not fine-tuning.”


In plain English

When you want an LLM to behave specifically for your use case — know your company’s docs, follow your style guide, use your jargon, answer in your tone — you have four main techniques:

  1. Prompt engineering — write a great prompt. Free. Instant. Limited.
  2. Long context — paste everything relevant into the prompt. Easy. Per-request cost. Limited by context window size.
  3. RAG (Retrieval Augmented Generation) — store your docs in a database; pull the relevant chunks at query time; include them in the prompt. Scales. Updates easily. More moving parts.
  4. Fine-tuning — retrain the model on your data. Permanent. Expensive upfront. Hard to update. Powerful when right.

People often reach for fine-tuning first because it sounds most impressive. It’s almost always the wrong starting point. Prompt engineering and RAG combined cover most use cases at lower cost and faster iteration.

This entry untangles the four, explains when each fits, and gives you a decision framework.


Why it matters

  • The wrong technique wastes money and time. Fine-tuning a model when prompt engineering would work is overkill. RAG when context-stuffing would suffice adds complexity for no win.
  • The right technique unlocks capabilities cheaply. Knowing the trade-offs lets you reach for the simplest thing that works.
  • The choice changes how you architect. Fine-tuning means a custom model. RAG means a vector database. Long context means a long prompt every call. These choices ripple through your system.

The four techniques in depth

1. Prompt engineering

You don’t change the model. You don’t add data. You just write a better prompt: clearer instructions, examples, formatting requirements, “respond as a senior engineer,” etc.

Pros:

  • Free
  • Instant — no setup
  • Iterates in seconds
  • Works with any model

Cons:

  • Limited to what fits in the prompt
  • No persistent memory across calls
  • Can’t really teach the model new facts

Use when: the task is fundamentally about clearer instructions, formatting, or persona. See Prompt engineering.

2. Long context

Modern LLMs have huge context windows — 200K tokens for Claude Opus, up to 1–2M tokens for some Gemini models. You can stuff the entire relevant material into the prompt.

System: You are a customer support agent. Use the company's full
support documentation below to answer the user's question accurately.

<documentation>
[Pages and pages of docs — say 50K tokens of them]
</documentation>

User: How do I cancel my subscription?

Pros:

  • Trivial to implement
  • Easy to update (just change the docs)
  • Works with any model that has the context length

Cons:

  • Costs scale with prompt size on every call (mitigated by prompt caching)
  • Quality may degrade for very long contexts (“lost in the middle” effect)
  • Limited by the actual context window

Use when: your reference material is under ~50–100K tokens AND you’ll query it often (cache it) OR rarely (cost is one-shot).

3. RAG (Retrieval Augmented Generation)

You store your documents (broken into chunks) in a vector database. At query time, find the most relevant chunks via semantic search, then include those chunks in the prompt.

Pros:

  • Scales to arbitrarily large knowledge bases
  • Cheap to update (re-index changed docs)
  • Citations are natural (you know which chunks were retrieved)
  • Cost per query is bounded by retrieval size, not corpus size

Cons:

  • More infrastructure (vector DB, embedding model, retrieval logic)
  • Quality depends on retrieval quality
  • Bad chunks → bad answers
  • Setup takes longer than long context

Use when: corpus exceeds the context window, OR you need to query specific subsets, OR you want easy updating.

Full deep-dive: RAG.

4. Fine-tuning

You take a base model and continue training it on your specific data. The model’s weights actually change. After fine-tuning, the model intrinsically “knows” the patterns in your data.

Pros:

  • Can teach the model new style, tone, voice
  • Faster inference (no need to include training material in prompt)
  • Can sometimes outperform RAG for stylistic / pattern tasks
  • Lower per-query cost (smaller prompts)

Cons:

  • Expensive upfront (training compute + data prep)
  • Slow to iterate (each retraining is a project)
  • Hard to update (new facts → retrain or supplement with RAG)
  • Limited model availability (only some providers offer fine-tuning; Claude has limited fine-tuning availability as of 2026)
  • Can degrade general performance
  • Requires real expertise and data hygiene

Use when: style/tone matters more than facts, AND you have lots of high-quality training examples, AND you have the budget and expertise.


A decision framework

When you need an LLM to handle your specific case:

Start: Can prompt engineering alone get you to "good enough"?
├── Yes → Stop. Use prompt engineering.
└── No → Next question

Does your reference material fit in the context window
(with caching for cost)?
├── Yes → Use long context.
└── No → Next question

Is the corpus large but the per-query relevant subset small?
├── Yes → Use RAG.
└── No → Next question

Is the challenge mostly STYLE / TONE / PATTERN, not facts?
├── Yes → Consider fine-tuning.
└── No → Use RAG (perhaps with better retrieval).

Are you sure you need fine-tuning?
├── Probably not → Try RAG harder first.
└── Yes → Fine-tune (and probably also use RAG for facts).

The honest answer in 2026: most production AI products combine prompt engineering + RAG. Fine-tuning is for specific, mature use cases.


Real-world examples

Customer support bot

  • Knowledge base → RAG over your docs
  • Tone/voice → prompt engineering (“You are a friendly senior support agent”)
  • Routing/escalation → tool use
  • No fine-tuning needed for most cases

Code assistant for your specific codebase

  • Codebase context → file tools (Claude Code-style “explore the code”) OR RAG over symbols
  • Coding style → prompt engineering (“Follow these conventions: …”) plus a few-shot example file
  • No fine-tuning needed

Style-specific content writer (mimicking a specific author’s voice)

  • Persona → prompt engineering may suffice
  • If style is highly specific and prompting falls short → fine-tuning might pay off
  • Facts → long context or RAG
  • One of the rare fine-tuning fits

Translating into a specific dialect

  • Examples in prompt → may work
  • For consistency at volume → fine-tuning makes sense
  • Another reasonable fine-tuning fit
  • Authoritative knowledge → RAG over verified sources
  • Domain reasoning → frontier model + good prompt
  • Citations → RAG with source tracking
  • Fine-tuning unnecessary if you trust the base model’s reasoning

Combining techniques

These aren’t mutually exclusive. Many serious systems combine:

  • Prompt engineering + RAG — the standard production combo
  • Long context + RAG — fine-grained retrieval, broad context for orientation
  • Fine-tuning + RAG — fine-tuned for style/format, RAG for facts
  • All four — for the most demanding production systems

The trick is using each for what it’s best at:

  • Facts → RAG
  • Style/tone → prompt or fine-tuning
  • Orientation → long context (e.g. a project overview always in prompt)
  • Instructions → prompt engineering

Cost comparison (rough)

For a customer support bot answering 10,000 queries/month:

ApproachMonthly cost (rough)Setup time
Prompt engineering only$30-100Hours
Prompt + long context (cached)$50-200A day
Prompt + RAG$50-300 + vector DB costA week
Fine-tuned model$5000+ training, then per-queryWeeks-months

Numbers vary wildly. But the order is consistent: prompt < long context < RAG < fine-tuning.


When fine-tuning IS the right answer

Genuinely valid fine-tuning use cases:

  1. You need a smaller, cheaper model to perform like a bigger one on a specific task. Distill knowledge.
  2. The task is highly stylistic and prompting can’t capture the patterns. (Lyric generation in a specific artist’s voice. Medical SOAP notes in a specific format.)
  3. The task is repetitive at huge volume and fine-tuning amortizes. (Categorizing millions of support tickets.)
  4. Compliance requires it. (On-prem fine-tuned model that never sends data out.)
  5. You have thousands of high-quality examples showing exactly the input→output you want.

If your use case doesn’t clearly fit one of these, start with RAG.


Common gotchas

  • “Fine-tune to give the model new facts.” Mostly doesn’t work well. Fine-tuning teaches patterns and styles, not facts. Facts go in RAG.

  • Fine-tuning a small model to compete with a frontier model. Sometimes works for narrow tasks; rarely for broad reasoning. The frontier model’s general capabilities are hard to beat.

  • Underestimating data prep. Fine-tuning is 10% training, 90% getting clean labeled examples. Most fine-tuning projects fail at data, not training.

  • RAG with no chunking strategy. Throwing whole documents at the vector DB gives bad retrievals. Chunk thoughtfully.

  • Long context with bad organization. Stuffing 100K tokens of disorganized docs ≠ helpful. Structure matters.

  • Not measuring quality. Without evaluation, you can’t tell if your fine-tuning / RAG / prompts are actually better than the baseline.

  • Fine-tuning then never updating. Models go stale. Your docs change. Fine-tuned models forget the new stuff unless you re-train.

  • Choosing fine-tuning for “personalization.” Per-user personalization is much better done via context (load the user’s data into the prompt) than per-user fine-tuning.

  • Ignoring prompt caching for long context. Without caching, you pay full input price every time. With caching, the static portion is 10% cost.

  • Mistaking faster inference for better outcomes. A fine-tuned smaller model is faster but may produce worse answers. Measure end-to-end quality, not just latency.

  • “My fine-tuning is overfitting / over-confident.” Common. A model fine-tuned only on positive examples becomes excessively positive. Need negative examples too.

  • Hybrid systems’ complexity. Combining all four techniques in one system is powerful but creates a lot of moving parts. Make sure the wins justify the complexity.


See also

Sources