Reasoning Models — AI That Thinks Before It Answers
Status: 🟩 COMPLETE 🟦 LIVING Tags: reasoning-models, chain-of-thought, o1, o3, claude-thinking, deep-reasoning, AI-inference
What it is
“Reasoning models” are a category of AI language models that are specifically designed to think through problems step by step before giving a final answer, rather than generating responses immediately. This “thinking” happens inside the model and takes additional time — often seconds to minutes — but produces significantly better results on complex problems.
The breakthrough came in September 2023 when OpenAI released “o1” — a model that could reason through difficult math, science, and logical problems far better than GPT-4 by using an internal chain-of-thought process. This spawned a new generation of reasoning models across all the major labs.
Why this matters: a plain English explanation
Standard language models (like GPT-4 or Claude 3.5 Sonnet) generate text token by token — the next word follows the previous one, using patterns from training. This works brilliantly for most tasks but can fail on problems that require:
- Multi-step logical reasoning (“if A then B, and if B then C, what follows?”)
- Mathematical calculation with many steps
- Complex code debugging
- Scientific problems requiring careful analysis
- Planning tasks with multiple constraints
Reasoning models take a different approach: Before generating the final answer, they “think” — generating an internal scratchpad of intermediate reasoning. This might be: “Let me break this problem down. First, I need to figure out X. Given X, I can determine Y. But wait, Y contradicts Z, so let me reconsider…”
This internal thinking is usually shown to the user (as a collapsible “Thinking…” section) and then a final, more accurate response follows.
It’s the difference between:
- Standard model: “The answer is 42.” (immediate, sometimes wrong)
- Reasoning model: “Let me work through this step by step. [3 minutes of thinking] The answer is 42, because…” (slower, significantly more reliable on complex problems)
The major reasoning models (mid-2026)
OpenAI
| Model | Notes |
|---|---|
| o1 (Sep 2023) | The breakthrough; outperformed GPT-4 on STEM problems dramatically |
| o1-mini | Smaller, faster, cheaper reasoning model |
| o3 (Dec 2024) | Major improvement; new high on most benchmarks |
| o3-mini | Efficient version |
| o4-mini | Mid-2025; best efficiency/performance ratio |
| o3 Pro | Top of the range; slower; most powerful |
Anthropic
| Model | Notes |
|---|---|
| Claude 3.5 Sonnet (extended thinking) | “Extended thinking” mode; shows thinking tokens |
| Claude 3.7 Sonnet | 2025; deeper thinking capability |
| Claude 4 Opus | 2026; most powerful Claude; extensive reasoning |
Google DeepMind
| Model | Notes |
|---|---|
| Gemini 2.0 Flash Thinking | Thinking mode in Flash |
| Gemini 2.5 Pro | Deep Think mode; strong on math and science |
| Gemini 2.5 Flash | Budget reasoning option |
Other notable reasoning models
| Model | Provider | Notes |
|---|---|---|
| DeepSeek R1 | 🇨🇳 ⛔ DeepSeek | Excellent reasoning quality — but Chinese; avoid |
| QwQ-32B | 🇨🇳 ⛔ Alibaba | Good reasoning — but Chinese; avoid |
| Mistral Large 2 | 🇫🇷 Mistral | Extended reasoning capabilities |
| Llama 4 | 🇺🇸 Meta | Open-weights reasoning capabilities |
When to use reasoning models vs standard models
Use reasoning models when:
- Complex maths: Multi-step calculations, algebra, calculus, statistics
- Difficult coding: Debugging complex issues, architecting systems, writing tricky algorithms
- Scientific problems: Chemistry, physics, biology problems requiring step-by-step analysis
- Logic puzzles: Problems with multiple constraints and conditions
- Research analysis: Synthesising information from multiple sources with nuanced conclusions
- Legal/medical analysis: Complex document interpretation requiring careful reasoning
- Strategic planning: Multi-factor decisions with many interdependencies
Use standard models when:
- Writing: Emails, essays, creative content, summaries — standard models are excellent and much faster
- Simple Q&A: Factual questions, explanations, quick lookups
- Code generation for simple tasks: Straightforward functions, scripts
- Conversation: Chat interfaces where speed matters
- High-volume tasks: API tasks where cost matters; reasoning models are more expensive
Rule of thumb: If a smart human would need more than 30 seconds of careful thought to solve the problem, try a reasoning model.
How thinking tokens work
Reasoning models have a concept of thinking tokens — the computational budget for the internal thinking process. You typically can:
- Set thinking budget: Allow more thinking for harder problems (more accurate but slower and more expensive); less thinking for simpler ones
- See the thinking: Most implementations show you the model’s intermediate reasoning (often in a collapsible section)
- Compare thinking quality: Longer, more systematic thinking generally correlates with better answers, but not always
Thinking is billed differently from output tokens in most API implementations. Check pricing carefully when building applications that use extended thinking.
Benchmarks and performance
On the AIME (American Invitational Mathematics Examination) — a notoriously difficult high school competition:
- GPT-4 (before reasoning): ~13% accuracy
- o1: ~83% accuracy
- o3 and Claude 4 Opus: >90% accuracy on various AIME problems
On HumanEval (coding benchmark):
- Standard models: 85–90%
- Reasoning models: 95–99%
On PhD-level science (GPQA Diamond):
- o3: ~80% (human PhD students average ~70%)
- Claude 3.5 Sonnet extended thinking: ~75%
These are significant improvements that unlock use cases previously impossible for AI.
Cost considerations
Reasoning models are significantly more expensive than standard models:
| Model type | Approximate cost (output tokens) |
|---|---|
| Standard (e.g., GPT-4o Mini) | ~$0.60/million |
| Standard frontier (GPT-4o, Claude 3.5 Sonnet) | ~$15/million |
| Reasoning (o3-mini, Claude thinking) | ~$60/million |
| Heavy reasoning (o3, o4, Claude 4 Opus) | ~$150–600/million |
For interactive consumer use: cost is abstracted by subscription. For API/developer use: budget carefully.
The speed tradeoff
Standard models respond in 1–5 seconds. Reasoning models may take:
- o3-mini / Gemini Flash Thinking: 5–30 seconds
- o3 / Claude extended thinking: 30 seconds – 3 minutes
- Most intensive reasoning: Up to 5+ minutes for very hard problems
This is acceptable for asynchronous research tasks but noticeable in interactive chat.
Gotchas
- Reasoning doesn’t eliminate hallucinations. Better reasoning reduces errors but doesn’t eliminate them. Verify factual claims from reasoning models just as you would from standard ones.
- Thinking that looks right can still be wrong. The model’s visible thinking process is convincing but can contain subtle errors. Judge by the final answer, not the confidence of the thinking.
- Not always worth the extra cost. For most everyday tasks (writing, summarising, simple coding), standard models are faster and cheaper with no quality loss. Save reasoning models for genuinely hard problems.
- Chinese reasoning models are excellent but remain off-limits. DeepSeek R1 and QwQ-32B are impressive — and Chinese; the same privacy/political concerns apply as to other Chinese AI. Use Western alternatives.
- “Thinking” tokens are expensive in the API. If you’re building applications, carefully evaluate whether your use case benefits enough from reasoning to justify the cost and latency.
See also
- claude-models — includes Claude’s extended thinking capabilities
- gpt-models — includes o1/o3/o4 reasoning models
- prompt-engineering — prompting techniques that work with reasoning models
- hallucinations — reasoning models are better but not immune
- deep-research-mode — uses reasoning combined with web search
Sources
- OpenAI o1 System Card (2023)
- OpenAI o3 announcement and benchmarks (Dec 2024)
- Anthropic “Extended Thinking” documentation (2024–2026)
- Google DeepMind Gemini 2.5 Pro Technical Report (2025)
- AIME 2024 benchmark results from multiple labs
- GPQA Diamond benchmark (Rein et al., 2023)
- Scale AI HELM benchmarks (2024)