Reasoning Models — AI That Thinks Before It Answers

Status: 🟩 COMPLETE 🟦 LIVING Tags: reasoning-models, chain-of-thought, o1, o3, claude-thinking, deep-reasoning, AI-inference

What it is

“Reasoning models” are a category of AI language models that are specifically designed to think through problems step by step before giving a final answer, rather than generating responses immediately. This “thinking” happens inside the model and takes additional time — often seconds to minutes — but produces significantly better results on complex problems.

The breakthrough came in September 2023 when OpenAI released “o1” — a model that could reason through difficult math, science, and logical problems far better than GPT-4 by using an internal chain-of-thought process. This spawned a new generation of reasoning models across all the major labs.

Why this matters: a plain English explanation

Standard language models (like GPT-4 or Claude 3.5 Sonnet) generate text token by token — the next word follows the previous one, using patterns from training. This works brilliantly for most tasks but can fail on problems that require:

Multi-step logical reasoning (“if A then B, and if B then C, what follows?”)
Mathematical calculation with many steps
Complex code debugging
Scientific problems requiring careful analysis
Planning tasks with multiple constraints

Reasoning models take a different approach: Before generating the final answer, they “think” — generating an internal scratchpad of intermediate reasoning. This might be: “Let me break this problem down. First, I need to figure out X. Given X, I can determine Y. But wait, Y contradicts Z, so let me reconsider…”

This internal thinking is usually shown to the user (as a collapsible “Thinking…” section) and then a final, more accurate response follows.

It’s the difference between:

Standard model: “The answer is 42.” (immediate, sometimes wrong)
Reasoning model: “Let me work through this step by step. [3 minutes of thinking] The answer is 42, because…” (slower, significantly more reliable on complex problems)

The major reasoning models (mid-2026)

OpenAI

Model	Notes
o1 (Sep 2023)	The breakthrough; outperformed GPT-4 on STEM problems dramatically
o1-mini	Smaller, faster, cheaper reasoning model
o3 (Dec 2024)	Major improvement; new high on most benchmarks
o3-mini	Efficient version
o4-mini	Mid-2025; best efficiency/performance ratio
o3 Pro	Top of the range; slower; most powerful

Anthropic

Model	Notes
Claude 3.5 Sonnet (extended thinking)	“Extended thinking” mode; shows thinking tokens
Claude 3.7 Sonnet	2025; deeper thinking capability
Claude 4 Opus	2026; most powerful Claude; extensive reasoning

Google DeepMind

Model	Notes
Gemini 2.0 Flash Thinking	Thinking mode in Flash
Gemini 2.5 Pro	Deep Think mode; strong on math and science
Gemini 2.5 Flash	Budget reasoning option

Other notable reasoning models

Model	Provider	Notes
DeepSeek R1	🇨🇳 ⛔ DeepSeek	Excellent reasoning quality — but Chinese; avoid
QwQ-32B	🇨🇳 ⛔ Alibaba	Good reasoning — but Chinese; avoid
Mistral Large 2	🇫🇷 Mistral	Extended reasoning capabilities
Llama 4	🇺🇸 Meta	Open-weights reasoning capabilities

When to use reasoning models vs standard models

Use reasoning models when:

Complex maths: Multi-step calculations, algebra, calculus, statistics
Difficult coding: Debugging complex issues, architecting systems, writing tricky algorithms
Scientific problems: Chemistry, physics, biology problems requiring step-by-step analysis
Logic puzzles: Problems with multiple constraints and conditions
Research analysis: Synthesising information from multiple sources with nuanced conclusions
Legal/medical analysis: Complex document interpretation requiring careful reasoning
Strategic planning: Multi-factor decisions with many interdependencies

Use standard models when:

Writing: Emails, essays, creative content, summaries — standard models are excellent and much faster
Simple Q&A: Factual questions, explanations, quick lookups
Code generation for simple tasks: Straightforward functions, scripts
Conversation: Chat interfaces where speed matters
High-volume tasks: API tasks where cost matters; reasoning models are more expensive

Rule of thumb: If a smart human would need more than 30 seconds of careful thought to solve the problem, try a reasoning model.

How thinking tokens work

Reasoning models have a concept of thinking tokens — the computational budget for the internal thinking process. You typically can:

Set thinking budget: Allow more thinking for harder problems (more accurate but slower and more expensive); less thinking for simpler ones
See the thinking: Most implementations show you the model’s intermediate reasoning (often in a collapsible section)
Compare thinking quality: Longer, more systematic thinking generally correlates with better answers, but not always

Thinking is billed differently from output tokens in most API implementations. Check pricing carefully when building applications that use extended thinking.

Benchmarks and performance

On the AIME (American Invitational Mathematics Examination) — a notoriously difficult high school competition:

GPT-4 (before reasoning): ~13% accuracy
o1: ~83% accuracy
o3 and Claude 4 Opus: >90% accuracy on various AIME problems

On HumanEval (coding benchmark):

Standard models: 85–90%
Reasoning models: 95–99%

On PhD-level science (GPQA Diamond):

o3: ~80% (human PhD students average ~70%)
Claude 3.5 Sonnet extended thinking: ~75%

These are significant improvements that unlock use cases previously impossible for AI.

Cost considerations

Reasoning models are significantly more expensive than standard models:

Model type	Approximate cost (output tokens)
Standard (e.g., GPT-4o Mini)	~$0.60/million
Standard frontier (GPT-4o, Claude 3.5 Sonnet)	~$15/million
Reasoning (o3-mini, Claude thinking)	~$60/million
Heavy reasoning (o3, o4, Claude 4 Opus)	~$150–600/million

For interactive consumer use: cost is abstracted by subscription. For API/developer use: budget carefully.

The speed tradeoff

Standard models respond in 1–5 seconds. Reasoning models may take:

o3-mini / Gemini Flash Thinking: 5–30 seconds
o3 / Claude extended thinking: 30 seconds – 3 minutes
Most intensive reasoning: Up to 5+ minutes for very hard problems

This is acceptable for asynchronous research tasks but noticeable in interactive chat.

Gotchas

Reasoning doesn’t eliminate hallucinations. Better reasoning reduces errors but doesn’t eliminate them. Verify factual claims from reasoning models just as you would from standard ones.
Thinking that looks right can still be wrong. The model’s visible thinking process is convincing but can contain subtle errors. Judge by the final answer, not the confidence of the thinking.
Not always worth the extra cost. For most everyday tasks (writing, summarising, simple coding), standard models are faster and cheaper with no quality loss. Save reasoning models for genuinely hard problems.
Chinese reasoning models are excellent but remain off-limits. DeepSeek R1 and QwQ-32B are impressive — and Chinese; the same privacy/political concerns apply as to other Chinese AI. Use Western alternatives.
“Thinking” tokens are expensive in the API. If you’re building applications, carefully evaluate whether your use case benefits enough from reasoning to justify the cost and latency.

Sources

OpenAI o1 System Card (2023)
OpenAI o3 announcement and benchmarks (Dec 2024)
Anthropic “Extended Thinking” documentation (2024–2026)
Google DeepMind Gemini 2.5 Pro Technical Report (2025)
AIME 2024 benchmark results from multiple labs
GPQA Diamond benchmark (Rein et al., 2023)
Scale AI HELM benchmarks (2024)

Tech & AI, Explained

Explorer

reasoning-models