Reasoning Models — AI That Thinks Before It Answers

Status: 🟩 COMPLETE 🟦 LIVING Tags: reasoning-models, chain-of-thought, o1, o3, claude-thinking, deep-reasoning, AI-inference


What it is

“Reasoning models” are a category of AI language models that are specifically designed to think through problems step by step before giving a final answer, rather than generating responses immediately. This “thinking” happens inside the model and takes additional time — often seconds to minutes — but produces significantly better results on complex problems.

The breakthrough came in September 2023 when OpenAI released “o1” — a model that could reason through difficult math, science, and logical problems far better than GPT-4 by using an internal chain-of-thought process. This spawned a new generation of reasoning models across all the major labs.


Why this matters: a plain English explanation

Standard language models (like GPT-4 or Claude 3.5 Sonnet) generate text token by token — the next word follows the previous one, using patterns from training. This works brilliantly for most tasks but can fail on problems that require:

  • Multi-step logical reasoning (“if A then B, and if B then C, what follows?”)
  • Mathematical calculation with many steps
  • Complex code debugging
  • Scientific problems requiring careful analysis
  • Planning tasks with multiple constraints

Reasoning models take a different approach: Before generating the final answer, they “think” — generating an internal scratchpad of intermediate reasoning. This might be: “Let me break this problem down. First, I need to figure out X. Given X, I can determine Y. But wait, Y contradicts Z, so let me reconsider…”

This internal thinking is usually shown to the user (as a collapsible “Thinking…” section) and then a final, more accurate response follows.

It’s the difference between:

  • Standard model: “The answer is 42.” (immediate, sometimes wrong)
  • Reasoning model: “Let me work through this step by step. [3 minutes of thinking] The answer is 42, because…” (slower, significantly more reliable on complex problems)

The major reasoning models (mid-2026)

OpenAI

ModelNotes
o1 (Sep 2023)The breakthrough; outperformed GPT-4 on STEM problems dramatically
o1-miniSmaller, faster, cheaper reasoning model
o3 (Dec 2024)Major improvement; new high on most benchmarks
o3-miniEfficient version
o4-miniMid-2025; best efficiency/performance ratio
o3 ProTop of the range; slower; most powerful

Anthropic

ModelNotes
Claude 3.5 Sonnet (extended thinking)“Extended thinking” mode; shows thinking tokens
Claude 3.7 Sonnet2025; deeper thinking capability
Claude 4 Opus2026; most powerful Claude; extensive reasoning

Google DeepMind

ModelNotes
Gemini 2.0 Flash ThinkingThinking mode in Flash
Gemini 2.5 ProDeep Think mode; strong on math and science
Gemini 2.5 FlashBudget reasoning option

Other notable reasoning models

ModelProviderNotes
DeepSeek R1🇨🇳 ⛔ DeepSeekExcellent reasoning quality — but Chinese; avoid
QwQ-32B🇨🇳 ⛔ AlibabaGood reasoning — but Chinese; avoid
Mistral Large 2🇫🇷 MistralExtended reasoning capabilities
Llama 4🇺🇸 MetaOpen-weights reasoning capabilities

When to use reasoning models vs standard models

Use reasoning models when:

  • Complex maths: Multi-step calculations, algebra, calculus, statistics
  • Difficult coding: Debugging complex issues, architecting systems, writing tricky algorithms
  • Scientific problems: Chemistry, physics, biology problems requiring step-by-step analysis
  • Logic puzzles: Problems with multiple constraints and conditions
  • Research analysis: Synthesising information from multiple sources with nuanced conclusions
  • Legal/medical analysis: Complex document interpretation requiring careful reasoning
  • Strategic planning: Multi-factor decisions with many interdependencies

Use standard models when:

  • Writing: Emails, essays, creative content, summaries — standard models are excellent and much faster
  • Simple Q&A: Factual questions, explanations, quick lookups
  • Code generation for simple tasks: Straightforward functions, scripts
  • Conversation: Chat interfaces where speed matters
  • High-volume tasks: API tasks where cost matters; reasoning models are more expensive

Rule of thumb: If a smart human would need more than 30 seconds of careful thought to solve the problem, try a reasoning model.


How thinking tokens work

Reasoning models have a concept of thinking tokens — the computational budget for the internal thinking process. You typically can:

  • Set thinking budget: Allow more thinking for harder problems (more accurate but slower and more expensive); less thinking for simpler ones
  • See the thinking: Most implementations show you the model’s intermediate reasoning (often in a collapsible section)
  • Compare thinking quality: Longer, more systematic thinking generally correlates with better answers, but not always

Thinking is billed differently from output tokens in most API implementations. Check pricing carefully when building applications that use extended thinking.


Benchmarks and performance

On the AIME (American Invitational Mathematics Examination) — a notoriously difficult high school competition:

  • GPT-4 (before reasoning): ~13% accuracy
  • o1: ~83% accuracy
  • o3 and Claude 4 Opus: >90% accuracy on various AIME problems

On HumanEval (coding benchmark):

  • Standard models: 85–90%
  • Reasoning models: 95–99%

On PhD-level science (GPQA Diamond):

  • o3: ~80% (human PhD students average ~70%)
  • Claude 3.5 Sonnet extended thinking: ~75%

These are significant improvements that unlock use cases previously impossible for AI.


Cost considerations

Reasoning models are significantly more expensive than standard models:

Model typeApproximate cost (output tokens)
Standard (e.g., GPT-4o Mini)~$0.60/million
Standard frontier (GPT-4o, Claude 3.5 Sonnet)~$15/million
Reasoning (o3-mini, Claude thinking)~$60/million
Heavy reasoning (o3, o4, Claude 4 Opus)~$150–600/million

For interactive consumer use: cost is abstracted by subscription. For API/developer use: budget carefully.


The speed tradeoff

Standard models respond in 1–5 seconds. Reasoning models may take:

  • o3-mini / Gemini Flash Thinking: 5–30 seconds
  • o3 / Claude extended thinking: 30 seconds – 3 minutes
  • Most intensive reasoning: Up to 5+ minutes for very hard problems

This is acceptable for asynchronous research tasks but noticeable in interactive chat.


Gotchas

  • Reasoning doesn’t eliminate hallucinations. Better reasoning reduces errors but doesn’t eliminate them. Verify factual claims from reasoning models just as you would from standard ones.
  • Thinking that looks right can still be wrong. The model’s visible thinking process is convincing but can contain subtle errors. Judge by the final answer, not the confidence of the thinking.
  • Not always worth the extra cost. For most everyday tasks (writing, summarising, simple coding), standard models are faster and cheaper with no quality loss. Save reasoning models for genuinely hard problems.
  • Chinese reasoning models are excellent but remain off-limits. DeepSeek R1 and QwQ-32B are impressive — and Chinese; the same privacy/political concerns apply as to other Chinese AI. Use Western alternatives.
  • “Thinking” tokens are expensive in the API. If you’re building applications, carefully evaluate whether your use case benefits enough from reasoning to justify the cost and latency.

See also


Sources

  • OpenAI o1 System Card (2023)
  • OpenAI o3 announcement and benchmarks (Dec 2024)
  • Anthropic “Extended Thinking” documentation (2024–2026)
  • Google DeepMind Gemini 2.5 Pro Technical Report (2025)
  • AIME 2024 benchmark results from multiple labs
  • GPQA Diamond benchmark (Rein et al., 2023)
  • Scale AI HELM benchmarks (2024)