Tokens & context windows

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Tokens are the chunks LLMs actually see. The context window is the model’s working memory. Almost every cost, every limit, every “the AI forgot what we were doing” trace back to these two ideas.

In plain English

When you send text to an LLM, the model doesn’t see “words.” It sees tokens — small chunks, sometimes whole words, sometimes parts of words, sometimes single characters, sometimes punctuation.

“Hello world” → 2 tokens
“The unbelievable” → 3 tokens ("The", " un", "believable")
“supercalifragilisticexpialidocious” → ~7 tokens (broken into manageable chunks)
”🎉” → 1 token (modern tokenizers handle emoji)
Chinese, Arabic, Cyrillic text → often more tokens per character than English

Rule of thumb in English:

~4 characters ≈ 1 token
~¾ of a word ≈ 1 token
1 page of dense text ≈ 500 tokens
1 hour of a podcast transcript ≈ ~5,000 tokens

The context window is the maximum number of tokens the model can see in one go — the prompt + the conversation so far + the response it’s about to generate. For Claude Opus 4.7 in 2026, this is ~200,000 tokens (about 500 pages of text).

Past the window, the model can’t see anything. Things “fall off the back” of the conversation as new content comes in.

Why it matters

Three reasons:

You pay per token. Every API call is billed in tokens — both input (what you send) and output (what the model produces). Long prompts cost more. Verbose outputs cost more.
Limits are in tokens, not words. “200K context” means tokens, not English words. The conversion isn’t huge, but it matters.
The model “forgets” past the window. In a very long conversation, early messages effectively disappear from the model’s view. This is why Claude Code uses memory files — to retain context across sessions instead of trying to keep everything in one conversation.

If you don’t think in tokens, you’ll be surprised by costs, confused by why long conversations degrade, and bad at structuring prompts for caching.

How tokenization actually works

Tokenizers are trained alongside the model. Common words get their own token; rare words or character sequences get broken into pieces.

Common English words: usually 1 token each (the, is, but, code)
Less common: 2–3 tokens (tokenization → token, ization)
Made-up or non-English words: many tokens (zxcvbnm → 4+ tokens, character by character)
Code: variable. function is 1 token; a long identifier might be 4+
Whitespace: usually attached to the next word (" code" is often one token, distinct from "code")

You can experiment with Anthropic’s tokenizer playground (OpenAI’s tokenizer differs from Anthropic’s but the principle is the same) to see exactly how text becomes tokens.

Inputs vs outputs (and the cost difference)

API pricing usually splits input tokens (what you send) and output tokens (what the model produces).

Token type	Typical price (Claude Opus, 2026)
Input	~$15 per million tokens
Output	~$75 per million tokens
Cached input	~$1.50 per million tokens (10× discount)

Output tokens are roughly 5× more expensive than input tokens. This is because generating output requires running the model billions of operations per token; processing input is cheaper.

Practical takeaways:

Long outputs are expensive. A 10,000-token essay costs ~75¢. A summary costs ~7¢.
Caching input is huge. If you’re sending the same system prompt + retrieved docs across many requests, marking them cacheable saves 90% of the input cost.
For high-volume use, optimize prompts. Trim unnecessary instructions, use shorter examples, structure output to be concise.

The context window in detail

The window is shared between:

System prompt (instructions, persona, rules) — typically 500–5000 tokens
Conversation history (all prior messages in this chat) — grows over time
Retrieved context (RAG docs, file contents) — varies
The next message (the user’s input) — typically small
Reserved for output (the model’s next response) — up to max_tokens you set

If the total approaches the limit, you have to do something:

Truncate old messages
Summarize the conversation history
Move details to retrieval (RAG)
Use a model with a larger window

Modern frontier models:

Model	Context window
Claude Opus 4.7 / 4.8	~200,000 tokens (~500 pages)
Claude Sonnet 4.6	~200,000 tokens
Claude Haiku 4.5	~200,000 tokens
GPT-4-class	128,000–1,000,000 tokens (varies)
Gemini Pro	1,000,000–2,000,000 tokens

A few experimental models offer multi-million-token windows. The practical limit isn’t just “what fits” — it’s also that quality degrades at the extreme end of long contexts (the “lost in the middle” effect — details in the middle of a very long prompt get less attention than the start or end).

Prompt caching — the production lever

Modern Claude supports prompt caching. You mark portions of your prompt as cacheable; the model stores the computed state of those tokens for a short period (typically 5 minutes); subsequent requests reuse the cache at ~10% of the normal token cost.

Best candidates for caching:

System prompts (the same instructions every call)
Long retrieved context (RAG docs you’ll query against multiple times)
Conversation history in agents (the front of the conversation stays stable)
Knowledge bases, codebases, or reference docs included in every call

Caching can reduce API costs by 70–95% for agent workloads. For Claude Code specifically, caching means subsequent turns in the same conversation pay much less for the increasingly long context.

A concrete example: where the tokens go in a Claude Code session

Here’s a rough breakdown of token consumption in a typical Claude Code conversation about a feature:

Component	Approximate tokens
System prompt (built-in instructions for Claude Code)	~3,500
Your global CLAUDE.md	~500
Project CLAUDE.md + MEMORY.md index	~600
Tool definitions (Read, Edit, Bash, etc.)	~5,000
The first user message	~50
Claude’s first response (planning)	~300
Tool calls + results (file reads)	~2,000–10,000 (varies wildly)
Subsequent user messages	~50 each
Subsequent Claude responses	~100–800 each
Total for a 30-minute session	~30,000–100,000 tokens

Caching helps enormously here — the system prompt, CLAUDE.md, and tool definitions stay the same across turns and get cached. The marginal cost of a follow-up turn is much lower than the initial one.

Common gotchas

The token count isn’t predictable from word count. Counting “200 words” doesn’t tell you the token count. Use a tokenizer for precision when it matters.
Non-English text uses more tokens per character. Chinese, Japanese, Arabic, Cyrillic — each character is often 2–3 tokens, vs English where 4 characters ≈ 1 token. Costs and limits scale accordingly.
Whitespace and formatting count. A JSON object with a lot of whitespace uses more tokens than the same data compact. For very token-sensitive uses, minify.
Code costs more than prose per visible character. Lots of unique identifiers, symbols, indentation. Be aware when sending big files.
Repeated content doesn’t compress. Sending the same paragraph twice costs twice the tokens. Use references / IDs / pointers if you can.
max_tokens limits the output, not the input. Set this to control the maximum response length. Setting it too low can cut off responses; setting it too high doesn’t cost more (you pay for what’s generated, not what’s reserved).
Streaming doesn’t reduce cost. You pay for the tokens whether you stream them or wait for the full response. Streaming just affects perceived latency.
“Lost in the middle.” Models attend less carefully to content in the middle of very long prompts than to the start and end. Put critical info near the start (system prompt) or end (most recent context).
Token counts vary slightly between Anthropic and OpenAI. Different tokenizers. A 1000-token prompt in Claude might be a 1100-token prompt in GPT. Don’t assume parity.
The context window is total, not per-side. A 200K window means input + output combined. If your prompt is already 195K, you have only 5K left for the response.

Sources

Anthropic — Prompt Caching docs
Anthropic — Models and pricing
OpenAI Tokenizer — visual tool to see tokenization
Lost in the Middle paper — research on long-context attention degradation
tiktoken — OpenAI’s tokenizer, useful for estimating

Tech & AI, Explained

Explorer

tokens-and-context