How LLMs work — under the hood

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: The transformer architecture, in plain English. Optional reading — you don’t need it to use LLMs — but it makes everything else click.


In plain English

Behind every modern LLM is a single architectural innovation called the transformer, introduced in a 2017 paper called “Attention Is All You Need.” Before transformers, the AI field had been trying for decades to make machines understand language; the results were modest. After transformers, scaling them up to billions of parameters produced the LLMs we have today.

The transformer’s key trick is called attention — a mechanism that lets each token in the input “look at” every other token, weighted by relevance, when figuring out what to do next. This sounds boring; in practice it’s why LLMs can suddenly handle long-range dependencies (the subject of a sentence in paragraph one referenced by paragraph five), understand context, and produce coherent multi-page output.

You don’t need to understand transformers to use LLMs effectively. But once you do, several things click:

  • Why LLMs have a context window
  • Why prompt structure affects outputs so much
  • Why some tasks are surprisingly easy and others surprisingly hard
  • What “training” actually does
  • Why bigger models tend to be better at reasoning

This entry is a tour. Not deep; deep enough to make intuitions reliable.


Why it matters

Three reasons:

  1. You’ll understand the failures. Why models hallucinate. Why they can’t count. Why long contexts get fuzzy. Each failure mode traces back to the architecture.
  2. You’ll write better prompts. When you understand how the model processes input, you write prompts that work with the grain.
  3. You’ll read AI news critically. “GPT-5 has emergent reasoning capabilities!” — knowing the architecture lets you guess whether that’s plausible or hype.

Step 1: Tokenization (input → numbers)

The model can’t read text. It reads numbers. So step 1 of every LLM call: tokenize your input.

A tokenizer splits text into chunks (tokens) and maps each one to an integer ID. “Hello world” might become [15043, 1917]. See Tokens & context windows for the full picture.

Each token ID is then converted to an embedding — a vector of typically 4096 numbers. The embeddings are looked up from a learned matrix. Two semantically-similar tokens ("happy" and "joyful") will have embeddings that are mathematically close.

After this preprocessing, the input is a list of vectors. That’s what flows through the rest of the model.


Step 2: Positional encoding

The model needs to know token order. “Dog bites man” and “Man bites dog” use the same tokens; meaning depends on order.

To inject order info, each token’s embedding gets a positional encoding — a vector that encodes “I’m the 1st token / 2nd token / 137th token.” Modern variants use rotary position embeddings (RoPE) or other clever schemes; the principle is the same.

After this step, each input position has a single combined vector containing both “what is this token” and “where is it in the sequence.”


Step 3: The transformer block (the heart of it)

Now the magic happens. The input vectors flow through a stack of transformer blocks — typically 30 to 100 of them in a modern LLM. Each block does two main things in sequence:

Self-attention: each token looks at every other token

For each token position, the model computes three things from its current vector:

  • Query (Q) — “what am I looking for?”
  • Key (K) — “what do I represent?”
  • Value (V) — “what’s my actual content?”

The model then computes attention scores: how much should token i’s query care about token j’s key? This produces a matrix of scores — every token vs every other token.

The scores are softmaxed (turned into probabilities that sum to 1), and used to weight the values: each token’s output is a weighted sum of all tokens’ values.

In plain English: every token gets to “look at” every other token and pull in relevant information. The pronoun “she” in sentence 5 can attend to a name from sentence 1, pulling its meaning. The verb at the end of a sentence can attend to its subject at the start.

This is the entire reason transformers are so good at language.

Multi-head attention: instead of one attention computation, the model does many in parallel (e.g. 32 “heads”), each learning to attend to different kinds of relationships. One head might learn to attend pronouns to their referents; another might attend verbs to their objects.

Feed-forward network: each token does its own thinking

After attention, each token’s vector passes through a feed-forward network (a small dense neural network) that operates on each token independently. This adds “depth” — letting the model compute non-trivial transformations on each token’s information.

Residual connections + normalization

These are technical touches that make the gradient flow well during training. Inputs to each block are added to its outputs (residual), and normalized. You don’t need to think about these; just know they’re crucial for training to work at all.

Stack of blocks

You then stack many of these blocks. Each block lets information flow further between tokens. After 30 layers of attention + feed-forward, the model has integrated extremely rich representations of each position.


Step 4: The output head

After all the blocks, you have a vector for each input position. To produce the next token:

  • Take the vector at the last position
  • Multiply it by a big “unembedding” matrix to produce logits — one score per token in the vocabulary (typically ~50,000 to ~200,000 tokens)
  • Apply softmax to turn logits into probabilities
  • Sample one token from the distribution (see Temperature & sampling)
  • Append the chosen token to the input

Then repeat — feed the now-longer sequence through the model again, get the next token, repeat. One token at a time, autoregressively, until the model emits a special “end of turn” token or hits a stopping condition.


What “training” actually does

Pre-training is the process of figuring out the model’s billions of parameters: all those embedding matrices, attention weights, feed-forward weights.

The setup is brutally simple:

  1. Take a huge chunk of text (a paragraph, a page).
  2. Feed all but the last word into the model.
  3. Ask: “what’s the last word?”
  4. Compare the model’s prediction to the actual last word.
  5. Compute the loss (how wrong was the model?).
  6. Backpropagate: adjust every parameter in the direction that would have made the model less wrong.
  7. Repeat for trillions of input examples.

Over months of training on thousands of GPUs, the model’s parameters converge to a configuration that does very well at predicting the next word in arbitrary text. That’s pre-training.

Post-training then refines the base model:

  • Supervised fine-tuning (SFT): show the model examples of “helpful assistant” responses; nudge it toward producing similar ones.
  • Reinforcement Learning from Human Feedback (RLHF): rank pairs of responses for quality; train the model to produce more of the preferred ones.
  • Constitutional AI / RLAIF (Anthropic’s twist): define a written “constitution” of principles; use AI feedback to refine outputs against those principles, with less reliance on human labelers.

After post-training, you have an “assistant” model — Claude, GPT, etc. — that actually behaves usefully.


A concrete example: what happens when you ask “What’s 2+2?”

  1. Tokenize: [“What”, “‘s”, ” 2”, ”+”, “2”, ”?”] → 6 token IDs.
  2. Embed + position-encode: 6 vectors, each ~4096 numbers.
  3. Run through all transformer blocks: attention lets the tokens “talk” — the “2”s attend to each other, the ”+” attends to both numbers, the ”?” sets the expected output as an answer.
  4. Output head: the last position’s vector is processed, producing logits. The token for “4” has a much higher logit than other digits.
  5. Sample: at low temperature, “4” is picked.
  6. Repeat for more output: the next iteration produces an end-of-turn token (since 4 was the answer).

The whole thing — for this short input — takes maybe 100ms. The forward pass through all the transformer blocks is the bulk of the work; everything else is fast.

The model didn’t compute 2+2 the way a calculator does. It pattern-matched: “questions of the form ‘what’s N+M?’ typically have answer tokens that look like sums of N and M.” It “knows” 2+2=4 because that pattern is extremely well-represented in its training data. For larger numbers, this breaks down — the model can fail arithmetic the way a child who memorized small sums fails at long multiplication.


What architectural details mean in practice

A few connections from architecture to behavior:

Architecture factWhat it means in practice
Attention is O(n²) in context lengthLong contexts are quadratically more expensive than short ones. Limits practical context window.
Every token attends to every other tokenThe model can connect ideas across the prompt — but quality degrades when the prompt gets very long.
Each forward pass produces one tokenGenerating 1000 tokens means 1000 forward passes. Long outputs cost much more than long inputs.
Training is autoregressive predictionThe model is fundamentally trained to predict what comes next — not to be truthful or helpful. Those come from post-training.
Token probabilities, not “knowledge”The model doesn’t have a knowledge base; it has weighted probabilities. “Knowledge” is a useful fiction.
Models can’t update their weightsModels can’t learn from your conversation. Memory has to be built externally.

What’s next in the architecture

The basic transformer has been remarkably stable since 2017. Newer models add:

  • Mixture of Experts (MoE) — instead of running every token through every parameter, route tokens to “expert” sub-networks. More efficient at scale.
  • Sparse attention — instead of every-token-attends-to-every-token, attend only to a structured subset. Reduces the n² cost; enables much longer context.
  • State-space models (Mamba, RWKV) — alternative architectures that don’t use attention. Faster for long sequences. Mostly research-stage in 2026 but advancing.
  • Multimodal architectures — same transformer principles, applied to images, audio, video. The same “tokens” framework, just different tokenizers.

But for now, “frontier model” = “very large transformer, very well trained, with attention + feed-forward + careful tuning.” That’s the dominant paradigm, and it’s likely to remain for at least a few more years.


Common gotchas / misconceptions

  • The model doesn’t “look up” facts. It pattern-matches based on its training. Facts can be wrong, especially niche or recent ones.

  • The model doesn’t “search” the internet. It only knows what was in its training data, unless given a tool.

  • Attention isn’t human attention. It’s a mathematical operation that produces useful weighted sums. Anthropomorphizing it (“the model is focusing on…”) is fine for intuition but misleading if taken literally.

  • Bigger isn’t always smarter. Past a point, returns diminish. Better data, better training procedures, and tool use can all outperform raw parameter scaling.

  • Emergence is real but mis-described. Larger models do show capabilities small models lack (“emergent abilities”). The phenomenon is real; the mechanism is still debated.

  • The architecture isn’t sufficient for human-like intelligence. Transformers are excellent at language; they don’t have persistent goals, embodiment, or sustained reasoning across truly long horizons without external scaffolding. The gap between “very capable” and “human-like” remains.


See also


Sources