What is an LLM?

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: A statistical engine for predicting the next chunk of text — trained on so much data that doing this prediction extremely well ends up looking like intelligence.


In plain English

A Large Language Model (LLM) is a very large mathematical function that takes text in and produces text out. Specifically, it takes a sequence of text (your prompt) and predicts what should come next, one chunk at a time.

That’s it. There is no “thinking” happening inside in the human sense. There is no understanding, no reasoning the way you reason, no internal model of the world the way humans have. There is prediction. Very, very good prediction, at very large scale.

The “trick” is that prediction at this scale starts to look like reasoning. If a model has read enough text — books, Wikipedia, code, papers, conversations, every webpage on the internet — then “what should come next?” given a question like “explain photosynthesis in 3 paragraphs” produces a coherent 3-paragraph explanation. Not because the model understands photosynthesis. Because the patterns of how good explanations of photosynthesis look are baked into it from training.

Reasoning emerges from prediction, the way water emerges from H2O. Neither molecule of hydrogen “knows” how to be wet. But put enough of them together with oxygen, and you get water. LLMs are the same: enough good prediction at enough scale, and what comes out behaves like reasoning — even though no individual computation inside is reasoning.


Why it matters

This is the single most important concept to get right about LLMs. If you have a wrong mental model — if you think of them as “AI that understands” — you’ll be confused by their failures and over-impressed by their successes. If you have the right model — “very good next-token prediction” — both their power and their limits make sense.

Specifically:

  • Their power makes sense: if good prediction produces good output for “write a Python function that sorts a list,” then they can write code.
  • Their limits make sense: they hallucinate (predict plausible nonsense) because they’re optimizing for “what looks right,” not “what is true.” They struggle with exact arithmetic because nothing in next-token prediction enforces correctness. They have no memory of past conversations because each call is independent.

Holding this mental model also makes you better at using them. You learn to give the kinds of inputs that produce the kinds of outputs you want — because you understand the machine is pattern-matching, not deducing.


How the prediction actually works

Tokens, not words

The LLM doesn’t see your text as words. It sees it as tokens — small chunks that are often parts of words, common short words whole, or punctuation. “The unbelievable” might tokenize as [“The”, ” un”, “believable”]. Three tokens, not two words. See Tokens & context windows for the full picture.

One token at a time

When you ask Claude “What’s the capital of France?”, the model:

  1. Tokenizes your input
  2. Runs it through billions of mathematical operations
  3. Outputs a probability distribution over the next possible token. Maybe: "The" (40%), "Paris" (25%), "France" (10%), and so on for every possible token.
  4. Picks one token according to a sampling strategy (see Temperature & sampling)
  5. Adds that token to the context
  6. Repeats step 2

So the model outputs "The", then " capital", then " of", then " France", then " is", then " Paris", then ".", then a special “end of turn” token that says “I’m done.”

The whole answer is produced one token at a time, each one influenced by everything that came before (including the model’s own previous output).

Where the magic happens

The “billions of mathematical operations” in step 2 are the model’s parameters — billions of numbers (weights and biases) that were tuned during training. These numbers encode everything the model “knows.” There’s no database of facts; there’s no lookup table. There’s just an enormous matrix multiplication that takes your tokens in and produces probabilities for the next one.

Modern LLMs have hundreds of billions of parameters. The model files themselves are hundreds of gigabytes. Running inference on them requires specialized hardware (mostly NVIDIA GPUs).


Training — how the model gets its weights

Training is the process of figuring out what those billions of parameters should be. Two main phases:

Phase 1: Pre-training

The model is shown enormous amounts of text — trillions of tokens, basically the public internet plus books, code, and everything else that’s been licensed or scraped. For each chunk of text, the model is asked: “given the previous tokens, what’s the next one?” If the model gets it wrong, the weights are nudged in the direction that would have gotten it right.

Repeat this trillions of times, on thousands of GPUs in parallel, for weeks or months. At the end, you have a “base model” that can predict text well, but doesn’t necessarily follow instructions or behave helpfully.

Phase 2: Post-training

The base model is then fine-tuned to be useful as an assistant. This involves:

  • Supervised fine-tuning (SFT): showing it examples of helpful assistant responses
  • Reinforcement learning from human feedback (RLHF): humans rate model outputs; the model is tuned to produce more of the high-rated kind
  • Constitutional AI / RLAIF (Anthropic’s approach): instead of all-human feedback, the model is trained against a “constitution” of principles, with AI feedback supplementing human feedback

The result is a model that follows instructions, refuses harmful requests, admits when it’s unsure, and behaves like an “assistant.”


What LLMs are surprisingly good at

Things that work astonishingly well, given the next-token prediction frame:

  • Language tasks across many languages. Translation, summarization, rephrasing.
  • Code. Writing, explaining, debugging. Code is heavily represented in training data, and code has clear right answers.
  • Pattern matching. “Make this list of names into a TSV in this format.” Easy.
  • Conversational structure. Models are well-tuned for dialog turns.
  • Explanation. Models excel at “explain X like I’m a Y” because their training includes many such examples.
  • Drafting. Outlines, first drafts, brainstorming.

What LLMs are bad at (and why)

The failure modes are predictable from the architecture:

  • Exact arithmetic. Multiplying two seven-digit numbers correctly. Nothing in next-token prediction enforces arithmetic correctness. Workaround: give the model a calculator tool.
  • Strict logical chains. Long deductive proofs are fragile. The model can lose track of a constraint several steps in.
  • Knowing what they don’t know. Models hallucinate — produce confident-sounding nonsense. They were never trained to say “I’m not sure”; they were trained to produce plausible text.
  • Counting. “How many words have I written so far?” — surprisingly hard.
  • Strict instruction following over very long contexts. Detail at the bottom of a 100K-token prompt can get diluted.
  • Real-time information. A model’s training cuts off at some point; it doesn’t know what happened after. Tools (web search) fix this.
  • Stable identity / memory across sessions. Each API call is independent unless you build memory yourself.

A concrete example: predicting one token

Imagine the prompt: "The capital of France is "

The model runs its math and produces a probability distribution over the next token. It might look something like:

" Paris"    → 92%
" the"      → 4%
" a"        → 1%
" Madrid"   → 0.1%
" the largest" → 0.05%
... (every other possible token gets some tiny probability)

With temperature 0 (greedy / deterministic), the model picks " Paris" — the highest probability. Output complete.

With temperature 0.7 (some randomness), the model picks according to the probabilities — still " Paris" most of the time, occasionally something else. With very high temperature, the model might pick " Madrid" and continue from there, producing something coherent but wrong.

This is the whole story, applied recursively token by token, billions of parameters’ worth of times.


The most common misconceptions

”It looked stuff up on the internet”

It didn’t. The model has no internet access unless explicitly given a tool for it. Everything it “knows” is encoded in its weights at training time. That includes facts that may now be outdated.

”It’s just a stochastic parrot / fancy autocomplete”

The “parrot” framing captures the architecture but understates the emergent behavior. Yes, the mechanism is next-token prediction. But the behavior at scale includes: composing, reasoning, planning, code writing, math (with errors), translation, explanation, etc. These behaviors are real, useful, and not what a parrot does. The mechanism is humble; the result is not.

”It thinks”

Not in the human sense. There’s no inner narrative, no emotions, no subjective experience (as far as anyone can show). It performs sophisticated text manipulation that produces outputs that look thoughtful. Whether that constitutes “thinking” is a philosophical debate, but it’s not the same kind of thinking you do.

”Bigger models are always better”

For a while, yes. Going from 1B to 10B to 100B parameters produced step-changes. But returns are diminishing, and the field is finding that smarter training, longer thinking time at inference, better data, and tool use can outperform brute parameter scaling. The “frontier” in 2026 is a mix of bigger models and better techniques.

”Open-source / open-weight models are the future / not the future”

Both. The frontier is closed (Claude, GPT, Gemini). Open-weight models (Llama, Mistral, Qwen) are smaller but rapidly catching up and dominate self-hosted / on-device use cases. Both ecosystems are healthy and likely to coexist.


Common gotchas

  • Don’t trust facts blindly. Models hallucinate, especially for niche topics, recent events, or specific numbers. Verify anything important.

  • Same prompt, different answer. Unless you set temperature to 0, repeated runs give varied outputs. This is by design.

  • The context window is finite. Past a model’s limit (~200K tokens for Claude Opus), things fall out of view. Long conversations effectively forget early turns.

  • It’s not always right because it’s confident. The model’s tone is shaped by training to sound assured. Confidence in output is uncorrelated with correctness.

  • “As an AI language model…” disclaimers are largely a habit baked in during fine-tuning, not a meaningful signal.

  • Cost scales with tokens, not words. Verbose prompts cost more. Verbose outputs cost more. Compressing your prompts is a real lever.

  • They don’t learn from your conversation. Unless you build memory, the model has no recollection of past chats.

  • Bias. Models reflect their training data. They have implicit views, blind spots, cultural defaults. Pretending otherwise is naive.


See also


Sources