Multimodal (vision, audio)

Status: 🟩 COMPLETE (🟦 LIVING — modalities and quality improve fast) Last updated: 2026-06-19 Plain-English tagline: LLMs that can do more than just read and write text — they can see images, hear audio, and increasingly understand video. Same underlying architecture; different sensory inputs.

In plain English

A traditional LLM only handles text. You give it text, it gives you text back. A multimodal model handles multiple modalities — most commonly text + images, sometimes also audio and video.

Modern Claude can:

Read an image you paste and describe what’s in it
Look at a screenshot of a webpage and tell you what’s wrong with the layout
Read a chart and explain the trend
Look at a handwritten note and transcribe it

GPT-4o, Gemini, and others have similar capabilities. Some also do speech — listen to audio and respond in speech.

This isn’t magic glued onto an LLM. The underlying transformer architecture is the same. The model is trained to treat images (or audio, or video) as another kind of “token sequence.” A 224×224 image becomes a few hundred image tokens; from there the model processes it just like text.

What this unlocks is huge: anything you can show a model — receipts, diagrams, photos of code, x-rays, satellite images, signed forms — becomes input it can reason about. Combined with tool use and agents, this turns LLMs into systems that can perceive and act on the visual world.

Why it matters

Massively expands what AI can do. Vision-only data (charts, photos, screenshots, scanned documents, design mockups) is no longer locked away from LLM workflows.
Often the lowest-friction path for tasks that involve documents or UI. Screenshot a Slack thread, ask “summarize.” Pasted image of a receipt, ask “what was the total?”
Increasingly part of “agentic” systems. Claude Code can drive a browser via the claude-in-chrome MCP and look at screenshots of the page to know what’s there. Computer use (mcp__computer-use) takes screenshots of the desktop and decides what to click.
Frontier capability that’s getting cheaper. Vision used to be a premium feature; in 2026 it’s standard in most modern Claude / GPT / Gemini tiers.

What modalities models support (mid-2026)

Modality	Claude	GPT-4o family	Gemini 2.5	Open models
Text in/out	✅	✅	✅	✅
Image in	✅	✅	✅	Some (Llama vision, Qwen-VL)
Image out (generation)	❌ (separate models like Stable Diffusion, Imagen, DALL-E)	✅	✅	Many
Audio in (speech recognition)	Some via tool integrations	✅	✅	Whisper
Audio out (speech synthesis)	Some via tool integrations	✅	✅	Many TTS models
Video in	Limited / via frames	Limited	✅ (native)	Limited
Video out	❌ (separate models like Sora, Veo)	❌	❌	Many specialized

Specifics shift monthly. The general pattern: text + image in/out is standard; audio is increasingly built-in; video is the cutting edge.

How vision works (briefly)

Conceptually:

The image is split into patches (small tiles, e.g. 16×16 pixels each)
Each patch is converted to a vector (“image embedding”) via a vision encoder
These vectors are inserted into the model’s token stream alongside text tokens
The model processes them with the same attention mechanism it uses for text

The model has been trained on (image, caption) pairs, (image, question, answer) datasets, and various visual reasoning tasks. After training, it can reason over the patch tokens just as it does text tokens.

A 224×224 image becomes typically 256 image tokens (16×16 grid of 1 token each). Higher resolutions use more tokens. So a high-res image can cost more than a paragraph of text in input tokens.

How audio works

Two distinct directions:

Speech-to-text (transcription)

Input: an audio file. Output: a text transcript. Examples: OpenAI Whisper, Google’s Speech-to-Text, Anthropic’s voice integrations.

Once transcribed, the rest is a normal text LLM call.

Text-to-speech (synthesis)

Input: text. Output: an audio file in a chosen voice. Examples: ElevenLabs, OpenAI’s TTS, Google’s Text-to-Speech.

True audio-in models

Some newer models accept raw audio without an intermediate transcription step — they process audio tokens directly. GPT-4o’s “advanced voice mode” works this way. Faster, more natural turn-taking, can pick up tone of voice and background sounds.

A concrete example: image input via the Claude API

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";
 
const client = new Anthropic();
 
const imageData = fs.readFileSync("./receipt.jpg").toString("base64");
 
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      {
        type: "image",
        source: {
          type: "base64",
          media_type: "image/jpeg",
          data: imageData
        }
      },
      {
        type: "text",
        text: "Extract the date, total, and merchant from this receipt as JSON."
      }
    ]
  }]
});
 
console.log(response.content[0].text);
// → { "date": "2026-06-19", "total": "$42.50", "merchant": "Coffee Shop" }

Plain text task on the surface; vision capability does the work.

Images can also be referenced by URL instead of base64 — saves you embedding the bytes in the request:

{
  "type": "image",
  "source": { "type": "url", "url": "https://example.com/receipt.jpg" }
}

Common multimodal use cases

Vision in production today

Document extraction — receipts, invoices, ID cards → structured JSON
Visual QA — “what’s wrong with this CSS?” with a screenshot
Accessibility — describing images for visually impaired users
Content moderation — detecting prohibited content
Visual search — find products similar to this photo
OCR — handwritten notes → typed text
Chart understanding — extracting trends and numbers from charts
Diagram explanation — understanding architecture diagrams, flowcharts
Code from screenshots — pasted UI mockup → React component
Robotics / autonomous systems — vision-language models for embodied AI

Audio in production

Meeting transcription — recorded meetings → searchable transcripts
Voice assistants — Siri, Alexa-style interfaces
Real-time translation — speak in one language, hear another
Podcast / video summaries — long audio → short summary
Accessibility captions — auto-generated captions on video
Sound classification — what’s in this audio (music, speech, alarms, etc.)
Voice cloning — replicate a specific voice (significant ethics implications)

Video

Action recognition — what’s happening in a video
Summarization — long video → text summary
Searching video content — find the moment when X happened
Live captioning — real-time text overlay

Cost considerations

Image tokens cost real money. A high-res image might cost 1000–3000 tokens in input. At Sonnet’s $3/1 M in p u t r a t e, t ha t^{'} s$ 0.003–$0.009 per image. Cheap for occasional use; adds up at scale.

To control:

Resize images before sending. A 1024×1024 image is usually plenty for “look at this and tell me.” Don’t send 4K originals.
Lower quality JPEG. Visual reasoning rarely needs lossless.
Crop to what matters. Sending a screenshot? Crop out chrome and background.
Use prompt caching for the static visual context (e.g. a reference image that’s always the same).

For audio:

Transcription is typically billed per minute of audio.
TTS is billed per character of text.

Limitations and quirks

Spatial precision can be weak. “What’s at coordinates (450, 320)?” — current models struggle with exact pixel positions. They’re better at “the menu is in the top right.”
Counting. “How many people in this photo?” can be approximate, not exact.
Text in images. OCR has improved but isn’t always perfect. Small or stylized text is hard.
Photorealism vs cartoon. Models trained on real photos can handle drawings and cartoons but with less accuracy.
Faces. Models generally won’t identify specific people by name (for privacy / policy reasons).
Charts/graphs. Extracting exact numbers from a chart is improved but still error-prone — verify if precision matters.
Animation / motion. A series of images implying motion is harder than a single image showing what’s happening.

Combining modalities

The most powerful patterns combine modalities:

Vision + tool use

The model looks at a screenshot and decides which button to click via a click_element tool. This is how computer-use agents work.

Text + image + structured output

Show an image, request JSON output, define the schema via tool definitions. The model returns extracted info reliably.

Audio + text + agents

A voice assistant: speech-in → transcription → text LLM → tool calls → text response → speech-out. Each step uses different models linked into one system.

Vision + RAG

Retrieve relevant text docs based on a user’s image query — e.g. visual search for a product, returning docs about the product.

Common gotchas

Image size affects cost a lot more than you expect. A 4K screenshot is enormous in tokens. Resize aggressively.
media_type matters. Wrong media type (PNG declared as JPEG) → silent failure or weird output.
Base64 encoding adds 33% size. Sending a 5MB image base64-encoded means a ~6.6MB request body.
URL-based image references must be reachable from Anthropic’s servers. Localhost won’t work; the image must be publicly accessible.
Privacy of image content. Anything sent to the API is processed by the model. Sensitive images (medical, personal docs) should be handled accordingly.
The model can describe but not interpret. “This shows a man holding a phone” — accurate. “He looks angry” — possibly true, possibly projection. Don’t ask for inferences beyond what’s visible.
Mixing many images degrades quality. Sending 20 images in one prompt confuses the model. Better to process one at a time or use a workflow.
Caching images is harder than caching text. The hash matters; even tiny image differences invalidate cache. Reuse exact bytes.
PII in images. OCR a passport, get the passport number in the output. Be careful with what you ask to extract.
alt text vs OCR. Asking “describe this image” and “transcribe text in this image” give different outputs. Be explicit.
Audio transcription accuracy varies by accent and quality. A noisy recording of a strong accent in a non-mainstream language is hardest.
Real-time audio (speech in, speech out) has different latency characteristics than the request/response API model. Plan UX accordingly.
Video is just a series of frames to most models today. Don’t expect deep temporal reasoning unless you use a video-native model.
Vision models can hallucinate visual content. “This image shows a cat” — when there’s no cat. Verify if it matters.
Region-specific models. Some multimodal capabilities are unavailable in certain countries/regions. Check.

Sources

Anthropic — Vision documentation
OpenAI — Vision (GPT-4 with vision)
OpenAI Whisper — speech-to-text
CLIP paper (OpenAI, 2021) — foundational vision-language model
Google — Gemini multimodal capabilities — current frontier on video
Hugging Face — Multimodal models

Tech & AI, Explained

Explorer

multimodal-vision-audio