Multimodal (vision, audio)

Status: 🟩 COMPLETE (🟦 LIVING — modalities and quality improve fast) Last updated: 2026-06-19 Plain-English tagline: LLMs that can do more than just read and write text — they can see images, hear audio, and increasingly understand video. Same underlying architecture; different sensory inputs.


In plain English

A traditional LLM only handles text. You give it text, it gives you text back. A multimodal model handles multiple modalities — most commonly text + images, sometimes also audio and video.

Modern Claude can:

  • Read an image you paste and describe what’s in it
  • Look at a screenshot of a webpage and tell you what’s wrong with the layout
  • Read a chart and explain the trend
  • Look at a handwritten note and transcribe it

GPT-4o, Gemini, and others have similar capabilities. Some also do speech — listen to audio and respond in speech.

This isn’t magic glued onto an LLM. The underlying transformer architecture is the same. The model is trained to treat images (or audio, or video) as another kind of “token sequence.” A 224×224 image becomes a few hundred image tokens; from there the model processes it just like text.

What this unlocks is huge: anything you can show a model — receipts, diagrams, photos of code, x-rays, satellite images, signed forms — becomes input it can reason about. Combined with tool use and agents, this turns LLMs into systems that can perceive and act on the visual world.


Why it matters

  • Massively expands what AI can do. Vision-only data (charts, photos, screenshots, scanned documents, design mockups) is no longer locked away from LLM workflows.
  • Often the lowest-friction path for tasks that involve documents or UI. Screenshot a Slack thread, ask “summarize.” Pasted image of a receipt, ask “what was the total?”
  • Increasingly part of “agentic” systems. Claude Code can drive a browser via the claude-in-chrome MCP and look at screenshots of the page to know what’s there. Computer use (mcp__computer-use) takes screenshots of the desktop and decides what to click.
  • Frontier capability that’s getting cheaper. Vision used to be a premium feature; in 2026 it’s standard in most modern Claude / GPT / Gemini tiers.

What modalities models support (mid-2026)

ModalityClaudeGPT-4o familyGemini 2.5Open models
Text in/outâś…âś…âś…âś…
Image inâś…âś…âś…Some (Llama vision, Qwen-VL)
Image out (generation)❌ (separate models like Stable Diffusion, Imagen, DALL-E)✅✅Many
Audio in (speech recognition)Some via tool integrationsâś…âś…Whisper
Audio out (speech synthesis)Some via tool integrationsâś…âś…Many TTS models
Video inLimited / via framesLimitedâś… (native)Limited
Video out❌ (separate models like Sora, Veo)❌❌Many specialized

Specifics shift monthly. The general pattern: text + image in/out is standard; audio is increasingly built-in; video is the cutting edge.


How vision works (briefly)

Conceptually:

  1. The image is split into patches (small tiles, e.g. 16Ă—16 pixels each)
  2. Each patch is converted to a vector (“image embedding”) via a vision encoder
  3. These vectors are inserted into the model’s token stream alongside text tokens
  4. The model processes them with the same attention mechanism it uses for text

The model has been trained on (image, caption) pairs, (image, question, answer) datasets, and various visual reasoning tasks. After training, it can reason over the patch tokens just as it does text tokens.

A 224Ă—224 image becomes typically 256 image tokens (16Ă—16 grid of 1 token each). Higher resolutions use more tokens. So a high-res image can cost more than a paragraph of text in input tokens.


How audio works

Two distinct directions:

Speech-to-text (transcription)

Input: an audio file. Output: a text transcript. Examples: OpenAI Whisper, Google’s Speech-to-Text, Anthropic’s voice integrations.

Once transcribed, the rest is a normal text LLM call.

Text-to-speech (synthesis)

Input: text. Output: an audio file in a chosen voice. Examples: ElevenLabs, OpenAI’s TTS, Google’s Text-to-Speech.

True audio-in models

Some newer models accept raw audio without an intermediate transcription step — they process audio tokens directly. GPT-4o’s “advanced voice mode” works this way. Faster, more natural turn-taking, can pick up tone of voice and background sounds.


A concrete example: image input via the Claude API

import Anthropic from "@anthropic-ai/sdk";
import fs from "fs";
 
const client = new Anthropic();
 
const imageData = fs.readFileSync("./receipt.jpg").toString("base64");
 
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      {
        type: "image",
        source: {
          type: "base64",
          media_type: "image/jpeg",
          data: imageData
        }
      },
      {
        type: "text",
        text: "Extract the date, total, and merchant from this receipt as JSON."
      }
    ]
  }]
});
 
console.log(response.content[0].text);
// → { "date": "2026-06-19", "total": "$42.50", "merchant": "Coffee Shop" }

Plain text task on the surface; vision capability does the work.

Images can also be referenced by URL instead of base64 — saves you embedding the bytes in the request:

{
  "type": "image",
  "source": { "type": "url", "url": "https://example.com/receipt.jpg" }
}

Common multimodal use cases

Vision in production today

  • Document extraction — receipts, invoices, ID cards → structured JSON
  • Visual QA — “what’s wrong with this CSS?” with a screenshot
  • Accessibility — describing images for visually impaired users
  • Content moderation — detecting prohibited content
  • Visual search — find products similar to this photo
  • OCR — handwritten notes → typed text
  • Chart understanding — extracting trends and numbers from charts
  • Diagram explanation — understanding architecture diagrams, flowcharts
  • Code from screenshots — pasted UI mockup → React component
  • Robotics / autonomous systems — vision-language models for embodied AI

Audio in production

  • Meeting transcription — recorded meetings → searchable transcripts
  • Voice assistants — Siri, Alexa-style interfaces
  • Real-time translation — speak in one language, hear another
  • Podcast / video summaries — long audio → short summary
  • Accessibility captions — auto-generated captions on video
  • Sound classification — what’s in this audio (music, speech, alarms, etc.)
  • Voice cloning — replicate a specific voice (significant ethics implications)

Video

  • Action recognition — what’s happening in a video
  • Summarization — long video → text summary
  • Searching video content — find the moment when X happened
  • Live captioning — real-time text overlay

Cost considerations

Image tokens cost real money. A high-res image might cost 1000–3000 tokens in input. At Sonnet’s 0.003–$0.009 per image. Cheap for occasional use; adds up at scale.

To control:

  • Resize images before sending. A 1024Ă—1024 image is usually plenty for “look at this and tell me.” Don’t send 4K originals.
  • Lower quality JPEG. Visual reasoning rarely needs lossless.
  • Crop to what matters. Sending a screenshot? Crop out chrome and background.
  • Use prompt caching for the static visual context (e.g. a reference image that’s always the same).

For audio:

  • Transcription is typically billed per minute of audio.
  • TTS is billed per character of text.

Limitations and quirks

  • Spatial precision can be weak. “What’s at coordinates (450, 320)?” — current models struggle with exact pixel positions. They’re better at “the menu is in the top right.”
  • Counting. “How many people in this photo?” can be approximate, not exact.
  • Text in images. OCR has improved but isn’t always perfect. Small or stylized text is hard.
  • Photorealism vs cartoon. Models trained on real photos can handle drawings and cartoons but with less accuracy.
  • Faces. Models generally won’t identify specific people by name (for privacy / policy reasons).
  • Charts/graphs. Extracting exact numbers from a chart is improved but still error-prone — verify if precision matters.
  • Animation / motion. A series of images implying motion is harder than a single image showing what’s happening.

Combining modalities

The most powerful patterns combine modalities:

Vision + tool use

The model looks at a screenshot and decides which button to click via a click_element tool. This is how computer-use agents work.

Text + image + structured output

Show an image, request JSON output, define the schema via tool definitions. The model returns extracted info reliably.

Audio + text + agents

A voice assistant: speech-in → transcription → text LLM → tool calls → text response → speech-out. Each step uses different models linked into one system.

Vision + RAG

Retrieve relevant text docs based on a user’s image query — e.g. visual search for a product, returning docs about the product.


Common gotchas

  • Image size affects cost a lot more than you expect. A 4K screenshot is enormous in tokens. Resize aggressively.

  • media_type matters. Wrong media type (PNG declared as JPEG) → silent failure or weird output.

  • Base64 encoding adds 33% size. Sending a 5MB image base64-encoded means a ~6.6MB request body.

  • URL-based image references must be reachable from Anthropic’s servers. Localhost won’t work; the image must be publicly accessible.

  • Privacy of image content. Anything sent to the API is processed by the model. Sensitive images (medical, personal docs) should be handled accordingly.

  • The model can describe but not interpret. “This shows a man holding a phone” — accurate. “He looks angry” — possibly true, possibly projection. Don’t ask for inferences beyond what’s visible.

  • Mixing many images degrades quality. Sending 20 images in one prompt confuses the model. Better to process one at a time or use a workflow.

  • Caching images is harder than caching text. The hash matters; even tiny image differences invalidate cache. Reuse exact bytes.

  • PII in images. OCR a passport, get the passport number in the output. Be careful with what you ask to extract.

  • alt text vs OCR. Asking “describe this image” and “transcribe text in this image” give different outputs. Be explicit.

  • Audio transcription accuracy varies by accent and quality. A noisy recording of a strong accent in a non-mainstream language is hardest.

  • Real-time audio (speech in, speech out) has different latency characteristics than the request/response API model. Plan UX accordingly.

  • Video is just a series of frames to most models today. Don’t expect deep temporal reasoning unless you use a video-native model.

  • Vision models can hallucinate visual content. “This image shows a cat” — when there’s no cat. Verify if it matters.

  • Region-specific models. Some multimodal capabilities are unavailable in certain countries/regions. Check.


See also

  • What is an LLM? đźź© — underlying tech
  • How LLMs work đźź© — transformers handle modalities similarly
  • The Claude API đźź© 🟦 — image input via the API
  • Claude models đźź© 🟦 — which models support what
  • Tokens & context windows đźź© — image tokens are real tokens
  • Tool use đźź© — combining vision with action
  • Agents đźź©
  • Embeddings đźź© — multimodal embeddings exist too (CLIP, SigLIP)
  • RAG đźź© — can include image retrieval
  • MCP đźź© 🟦 — computer-use and claude-in-chrome are vision-based MCP servers
  • Glossary: LLM, Token

Sources