Temperature & sampling
Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: The dials that control how creative vs predictable the model’s output is. Same prompt, different temperature, different answer.
In plain English
When an LLM produces a response, it doesn’t deterministically pick “the next token.” At each step, it computes a probability distribution over every possible next token — "the" 30%, "a" 12%, "some" 4%, etc., for every word and partial word in its vocabulary.
How the model chooses which token to actually output from that distribution is called sampling. The main control over sampling is temperature.
- Temperature 0 — always pick the most probable token. Deterministic. Same prompt → same output. (Well, approximately — see gotchas.)
- Temperature 1 — sample exactly according to the model’s computed probabilities. Some variation. Default for most APIs.
- Temperature > 1 — flatten the distribution; make lower-probability tokens more likely. More creative, more chaotic, more error-prone.
A handful of related controls (top_p, top_k, seed) refine the sampling further. Most of the time you only touch temperature.
Why it matters
Three reasons:
- You can tune the model’s behavior — predictable for structured tasks, creative for writing.
- You can reproduce results when needed (e.g. debugging, testing).
- You can explain variance when the same prompt gives different answers.
If you don’t know about temperature, the LLM’s randomness looks mysterious. Once you do, you can control it.
How temperature actually works
The model produces logits (raw scores) for each possible next token. To turn those into probabilities, it applies the softmax function — which uses temperature as a denominator:
probability(token i) = exp(logit_i / T) / ÎŁ exp(logit_j / T)
Plain English:
- T = 0 → softmax becomes “the most probable token gets probability 1, all others get 0” (greedy / argmax)
- T = 1 → standard softmax, sample naturally
- T = 2 → distribution gets flattened — rare tokens become more likely
- T = 0.5 → distribution gets sharpened — top tokens become even more dominant
You can visualize it as a thermostat on the model’s confidence. Cold (low T) → “I’ll pick what I’m sure about.” Hot (high T) → “let me try something unexpected.”
When to use which temperature
| Task | Temperature | Why |
|---|---|---|
| Code generation | 0 to 0.2 | You want correct, conventional code. Variation introduces bugs. |
| Math, calculation | 0 | One right answer. |
| Structured extraction (JSON output) | 0 to 0.3 | Reliable parsing. |
| Factual Q&A | 0 to 0.3 | Reduce hallucination risk. |
| Summarization | 0.3 to 0.5 | Some variation in phrasing is fine. |
| Editing / rewriting prose | 0.3 to 0.7 | Want a coherent, slightly varied output. |
| Creative writing | 0.7 to 1.0 | Variation is the point. |
| Brainstorming alternatives | 0.8 to 1.2 | You want diverse ideas. |
| ”Surprise me” generation | 1.0+ | Maximally explore. |
Default in most APIs is 1.0 (Anthropic) or 0.7 (some others). Override based on the task.
Other sampling controls
top_p (nucleus sampling)
Restricts sampling to only tokens whose cumulative probability is at least p. Setting top_p = 0.9 means: sort tokens by probability, take the top ones until their probabilities sum to 90%, then sample from that subset.
Acts as a guard against very low-probability outputs even with high temperature. Often used alongside temperature: temperature=0.9, top_p=0.95.
top_k
Restrict sampling to only the top k tokens. top_k = 50 means “only consider the 50 most probable tokens.” Less common in modern APIs; top_p has largely supplanted it.
Repetition penalty
Reduces the probability of recently-output tokens. Prevents the model from getting stuck repeating (“the the the the”). Most APIs handle this internally.
Stop sequences
Lists of strings that, if generated, immediately stop the output. Useful for constraining the format (e.g. stop=["\n\n"] to stop at a paragraph break).
Seed
If supported, fixes the random-number generator’s seed. Combined with temperature=0, gives you reproducible outputs across runs. (See gotcha: even with seed, some non-determinism can remain.)
A concrete example
Same prompt: "Give me a one-line product tagline for a coffee shop."
Temperature 0:
“Where every cup tells a story.”
(Same answer every time. Safe. Generic.)
Temperature 0.7:
“Bold beans, brighter mornings.” “Coffee that meets you where you are.” “Beans first, conversations second.”
(Different each time. Reasonable, on-brand.)
Temperature 1.5:
“Espresso for renegades.” “Caffeine therapy, no appointment required.” “Beans go in, vibes come out.”
(Wilder. Some hits, some misses.)
Temperature 2.5:
“Coffee, but make it electric.” “Brewing with intention since whenever.” “[grammatically odd output increases]”
(Often unhinged. Sometimes brilliant.)
You’d pick the temperature based on what you want. For a polished marketing brief, 0.7. For a brainstorm where you’ll cherry-pick, 1.2.
Reasoning models and “extended thinking”
A different mechanism — separate from temperature — is the model’s reasoning budget. Modern Claude (Opus with extended thinking, GPT-o-series) can spend extra tokens “thinking” internally before producing the final answer. The visible output uses normal sampling; the thinking step uses a different mechanism.
Extended thinking is the right knob for “I want more accurate output.” Temperature is the right knob for “I want more varied output.” They’re orthogonal.
Common gotchas
-
Temperature 0 isn’t strictly deterministic. On the same model version, with the same prompt, T=0 almost always gives the same output. But across model versions, even patch versions, outputs can differ slightly. For true reproducibility, pin model version + seed (where available).
-
Mixing temperature and structured output can backfire. If you’re using tool use to force JSON output and the temperature is high, the JSON values can become bizarre. Keep temperature low (0-0.3) for structured tasks.
-
High temperature isn’t smarter — it’s more random. “Be creative” isn’t a setting you flip; it’s an instruction you give in the prompt. Use temperature to add variance, not to add intelligence.
-
Low temperature can produce stuck loops. Especially with long outputs. If the model keeps repeating, lower temperature isn’t always the fix — sometimes you need to raise it slightly or use a repetition penalty.
-
Different providers have different defaults. Anthropic’s API default is 1.0; OpenAI’s is 0.7-ish; Google’s varies. Always specify temperature explicitly for production code.
-
top_p and temperature interact. Setting both is fine but tune one at a time. Most teams use temperature alone; reach for top_p when temperature alone doesn’t give you the variability profile you want.
-
Streaming and temperature are independent. Sampling happens token-by-token whether or not you stream the output. Streaming just affects when you receive the tokens.
-
Cached responses don’t sample. If you cache the response to a prompt, the temperature setting was already “used” when the original generation happened. Replays return the same output.
-
For agents, low temperature is usually right. You want predictable, reliable action selection. Save high temperature for end-user-facing generation steps.
See also
- What is an LLM? đźź©
- How LLMs work 🟩 — softmax mechanics
- Tokens & context windows đźź©
- Prompt engineering đźź©
- Claude models 🟩 🟦
- The Claude API 🟩 🟦
- Tool use 🟩 — low temperature recommended for tool use
- Agents đźź©
- Glossary: Token