Tokens & context windows
Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Tokens are the chunks LLMs actually see. The context window is the model’s working memory. Almost every cost, every limit, every “the AI forgot what we were doing” trace back to these two ideas.
In plain English
When you send text to an LLM, the model doesn’t see “words.” It sees tokens — small chunks, sometimes whole words, sometimes parts of words, sometimes single characters, sometimes punctuation.
- “Hello world” → 2 tokens
- “The unbelievable” → 3 tokens (
"The"," un","believable") - “supercalifragilisticexpialidocious” → ~7 tokens (broken into manageable chunks)
- ”🎉” → 1 token (modern tokenizers handle emoji)
- Chinese, Arabic, Cyrillic text → often more tokens per character than English
Rule of thumb in English:
- ~4 characters ≈ 1 token
- ~¾ of a word ≈ 1 token
- 1 page of dense text ≈ 500 tokens
- 1 hour of a podcast transcript ≈ ~5,000 tokens
The context window is the maximum number of tokens the model can see in one go — the prompt + the conversation so far + the response it’s about to generate. For Claude Opus 4.7 in 2026, this is ~200,000 tokens (about 500 pages of text).
Past the window, the model can’t see anything. Things “fall off the back” of the conversation as new content comes in.
Why it matters
Three reasons:
- You pay per token. Every API call is billed in tokens — both input (what you send) and output (what the model produces). Long prompts cost more. Verbose outputs cost more.
- Limits are in tokens, not words. “200K context” means tokens, not English words. The conversion isn’t huge, but it matters.
- The model “forgets” past the window. In a very long conversation, early messages effectively disappear from the model’s view. This is why Claude Code uses memory files — to retain context across sessions instead of trying to keep everything in one conversation.
If you don’t think in tokens, you’ll be surprised by costs, confused by why long conversations degrade, and bad at structuring prompts for caching.
How tokenization actually works
Tokenizers are trained alongside the model. Common words get their own token; rare words or character sequences get broken into pieces.
- Common English words: usually 1 token each (
the,is,but,code) - Less common: 2–3 tokens (
tokenization→token,ization) - Made-up or non-English words: many tokens (
zxcvbnm→ 4+ tokens, character by character) - Code: variable.
functionis 1 token; a long identifier might be 4+ - Whitespace: usually attached to the next word (
" code"is often one token, distinct from"code")
You can experiment with Anthropic’s tokenizer playground (OpenAI’s tokenizer differs from Anthropic’s but the principle is the same) to see exactly how text becomes tokens.
Inputs vs outputs (and the cost difference)
API pricing usually splits input tokens (what you send) and output tokens (what the model produces).
| Token type | Typical price (Claude Opus, 2026) |
|---|---|
| Input | ~$15 per million tokens |
| Output | ~$75 per million tokens |
| Cached input | ~$1.50 per million tokens (10× discount) |
Output tokens are roughly 5× more expensive than input tokens. This is because generating output requires running the model billions of operations per token; processing input is cheaper.
Practical takeaways:
- Long outputs are expensive. A 10,000-token essay costs ~75¢. A summary costs ~7¢.
- Caching input is huge. If you’re sending the same system prompt + retrieved docs across many requests, marking them cacheable saves 90% of the input cost.
- For high-volume use, optimize prompts. Trim unnecessary instructions, use shorter examples, structure output to be concise.
The context window in detail
The window is shared between:
- System prompt (instructions, persona, rules) — typically 500–5000 tokens
- Conversation history (all prior messages in this chat) — grows over time
- Retrieved context (RAG docs, file contents) — varies
- The next message (the user’s input) — typically small
- Reserved for output (the model’s next response) — up to
max_tokensyou set
If the total approaches the limit, you have to do something:
- Truncate old messages
- Summarize the conversation history
- Move details to retrieval (RAG)
- Use a model with a larger window
Modern frontier models:
| Model | Context window |
|---|---|
| Claude Opus 4.7 / 4.8 | ~200,000 tokens (~500 pages) |
| Claude Sonnet 4.6 | ~200,000 tokens |
| Claude Haiku 4.5 | ~200,000 tokens |
| GPT-4-class | 128,000–1,000,000 tokens (varies) |
| Gemini Pro | 1,000,000–2,000,000 tokens |
A few experimental models offer multi-million-token windows. The practical limit isn’t just “what fits” — it’s also that quality degrades at the extreme end of long contexts (the “lost in the middle” effect — details in the middle of a very long prompt get less attention than the start or end).
Prompt caching — the production lever
Modern Claude supports prompt caching. You mark portions of your prompt as cacheable; the model stores the computed state of those tokens for a short period (typically 5 minutes); subsequent requests reuse the cache at ~10% of the normal token cost.
Best candidates for caching:
- System prompts (the same instructions every call)
- Long retrieved context (RAG docs you’ll query against multiple times)
- Conversation history in agents (the front of the conversation stays stable)
- Knowledge bases, codebases, or reference docs included in every call
Caching can reduce API costs by 70–95% for agent workloads. For Claude Code specifically, caching means subsequent turns in the same conversation pay much less for the increasingly long context.
A concrete example: where the tokens go in a Claude Code session
Here’s a rough breakdown of token consumption in a typical Claude Code conversation about a feature:
| Component | Approximate tokens |
|---|---|
| System prompt (built-in instructions for Claude Code) | ~3,500 |
| Your global CLAUDE.md | ~500 |
| Project CLAUDE.md + MEMORY.md index | ~600 |
| Tool definitions (Read, Edit, Bash, etc.) | ~5,000 |
| The first user message | ~50 |
| Claude’s first response (planning) | ~300 |
| Tool calls + results (file reads) | ~2,000–10,000 (varies wildly) |
| Subsequent user messages | ~50 each |
| Subsequent Claude responses | ~100–800 each |
| Total for a 30-minute session | ~30,000–100,000 tokens |
Caching helps enormously here — the system prompt, CLAUDE.md, and tool definitions stay the same across turns and get cached. The marginal cost of a follow-up turn is much lower than the initial one.
Common gotchas
-
The token count isn’t predictable from word count. Counting “200 words” doesn’t tell you the token count. Use a tokenizer for precision when it matters.
-
Non-English text uses more tokens per character. Chinese, Japanese, Arabic, Cyrillic — each character is often 2–3 tokens, vs English where 4 characters ≈ 1 token. Costs and limits scale accordingly.
-
Whitespace and formatting count. A JSON object with a lot of whitespace uses more tokens than the same data compact. For very token-sensitive uses, minify.
-
Code costs more than prose per visible character. Lots of unique identifiers, symbols, indentation. Be aware when sending big files.
-
Repeated content doesn’t compress. Sending the same paragraph twice costs twice the tokens. Use references / IDs / pointers if you can.
-
max_tokenslimits the output, not the input. Set this to control the maximum response length. Setting it too low can cut off responses; setting it too high doesn’t cost more (you pay for what’s generated, not what’s reserved). -
Streaming doesn’t reduce cost. You pay for the tokens whether you stream them or wait for the full response. Streaming just affects perceived latency.
-
“Lost in the middle.” Models attend less carefully to content in the middle of very long prompts than to the start and end. Put critical info near the start (system prompt) or end (most recent context).
-
Token counts vary slightly between Anthropic and OpenAI. Different tokenizers. A 1000-token prompt in Claude might be a 1100-token prompt in GPT. Don’t assume parity.
-
The context window is total, not per-side. A 200K window means input + output combined. If your prompt is already 195K, you have only 5K left for the response.
See also
- What is an LLM? 🟩
- How LLMs work 🟩
- Prompt engineering 🟩
- Temperature & sampling 🟩
- The Claude API 🟩 🟦
- Claude models 🟩 🟦
- Tool use 🟩
- Agents 🟩 — burn tokens in their loops
- Fine-tuning vs context 🟩
- Multimodal (vision, audio) 🟩 🟦 — images count as tokens too
- Embeddings 🟩 — a different way to compress text into a representation
- Glossary: Token
Sources
- Anthropic — Prompt Caching docs
- Anthropic — Models and pricing
- OpenAI Tokenizer — visual tool to see tokenization
- Lost in the Middle paper — research on long-context attention degradation
- tiktoken — OpenAI’s tokenizer, useful for estimating