🇺🇸 United States · Anthropic Claude API

Status: 🟩 COMPLETE (🟦 LIVING — endpoints, pricing, and features evolve) Last updated: 2026-06-28 Plain-English tagline: How to talk to Claude from your own code. You POST a JSON message; Claude POSTs back a response. Same model as claude.ai, just programmatic.

VendorAnthropic
Country/origin🇺🇸 United States (San Francisco)
Recommended for AUS?✅ Yes — strong privacy commitments; enterprise DPA addresses APP 8
Privacy summaryAPI content NOT used for model training; 30-day retention for abuse monitoring; Zero Data Retention available for enterprise; SOC 2 Type II; HIPAA capable
Free tierâś… $5 USD free credit on signup; pay-as-you-go after
Paid tiersPay-per-token; volume discounts; enterprise contracts
First releasedMarch 2023 (Claude 1 API)
Last reviewedJune 2026
Official sitehttps://docs.anthropic.com

In plain English

The Claude API is how you call Claude from your own application instead of through claude.ai or Claude Code. You send an HTTP POST to Anthropic’s servers with your message and some config; Claude processes it; you get a response back.

This is what powers every Claude-based product — Claude Code is essentially a CLI wrapped around API calls. So is every customer support bot, research assistant, or “AI feature” inside another app that uses Claude underneath. If you ever want to build something with AI, the API is the door you’ll go through.

The good news: it’s a normal HTTP API. If you’ve ever called any REST API, you can call this one. You don’t need to know anything about machine learning. You write a POST request, you get JSON back. The hard part isn’t the API; it’s the prompt engineering and the surrounding system.

Anthropic provides official SDKs for Python, TypeScript, Java, Go, Ruby, and others — wrappers that handle the HTTP boilerplate for you. You can also just use fetch.


Why it matters

  • Building AI features. Any time your app needs Claude’s reasoning (chat, summarize, classify, extract, code-generate), you call the API.
  • Understanding what Claude Code is doing. Every tool call you see in Claude Code is an API call underneath. Knowing the shape helps you reason about cost, latency, and limits.
  • Reading any AI-product source code. Most “AI product” projects are wrappers around API calls. Understanding the wrapper layer lets you read those projects intelligently.
  • Cost intuitions. Once you’ve seen the request/response cycle, you understand why long prompts and long outputs cost what they cost — and can architect for low cost.

The endpoint and authentication

POST https://api.anthropic.com/v1/messages

Authentication is via an API key in headers:

x-api-key: sk-ant-...
anthropic-version: 2023-06-01
content-type: application/json

You generate API keys in the Anthropic Console (console.anthropic.com). Never put API keys in client-side code — they’d be visible to anyone. API calls go from your server, not the browser. (Same pattern as the Supabase service_role key.)


The Messages API — the request shape

The minimum useful request body:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "messages": [
    { "role": "user", "content": "Hello, Claude." }
  ]
}

Fields:

  • model — exact model version. Always pin in production. (claude-opus-4-8, claude-sonnet-4-6, claude-haiku-4-5, etc. — see Claude models.)
  • max_tokens — maximum length of the model’s response, in tokens. Required.
  • messages — the conversation as a list of role-tagged turns.

Optional but very common:

  • system — a system prompt setting persona, rules, context. Persists across messages.
  • temperature — randomness (0 = deterministic, 1 = default randomness). See Temperature & sampling.
  • tools — tool definitions, when you want Claude to be able to call tools. See Tool use.
  • stream — if true, response streams incrementally.

A richer request:

{
  "model": "claude-opus-4-8",
  "max_tokens": 2048,
  "system": "You are a friendly customer support agent. Be concise and warm.",
  "messages": [
    { "role": "user", "content": "How do I reset my password?" },
    { "role": "assistant", "content": "Sure! Go to..." },
    { "role": "user", "content": "What if I don't have access to my email?" }
  ],
  "temperature": 0.7
}

The messages array is your conversation history. The API is stateless — every call you make includes the entire prior conversation. The model doesn’t remember between calls; your code keeps the history and re-sends it.


The response shape

{
  "id": "msg_01ABC...",
  "type": "message",
  "role": "assistant",
  "model": "claude-sonnet-4-6",
  "content": [
    { "type": "text", "text": "Hello! Glad to help — what's on your mind?" }
  ],
  "stop_reason": "end_turn",
  "stop_sequence": null,
  "usage": {
    "input_tokens": 12,
    "output_tokens": 14
  }
}

Fields:

  • content — an array of content blocks (text, tool calls, etc.). For a plain text response, it’s just one text block.
  • stop_reason — "end_turn" (model finished naturally), "max_tokens" (hit your limit), "tool_use" (model wants to call a tool), "stop_sequence" (hit a configured stop string).
  • usage — exact token counts for this call. This is what you’re billed on.

Cost = input_tokens Ă— input_price + output_tokens Ă— output_price. See Tokens & context windows and Claude models for prices.


A concrete TypeScript example

The simplest possible:

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from env
 
const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "What is the capital of France?" }]
});
 
console.log(response.content[0].text);
// → "The capital of France is Paris."

That’s it. With the SDK, the API call is one function. Without the SDK, it’s a fetch with the right headers and body — also straightforward.


Streaming responses

For chat interfaces, you don’t want to wait for the entire response before showing anything. Streaming sends each chunk as it’s generated:

const stream = await client.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Write me a haiku about programming." }]
});
 
for await (const chunk of stream) {
  if (chunk.type === "content_block_delta" && chunk.delta.type === "text_delta") {
    process.stdout.write(chunk.delta.text);
  }
}

Streaming doesn’t reduce cost — you pay for the tokens whether you stream them or buffer. It just improves perceived latency.


Tool use in API calls

To let the model call tools, pass tools and handle the response loop. Outline:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  tools: [
    {
      name: "get_weather",
      description: "Get current weather for a city",
      input_schema: {
        type: "object",
        properties: { city: { type: "string" } },
        required: ["city"]
      }
    }
  ],
  messages: [{ role: "user", content: "What's the weather in Sydney?" }]
});
 
if (response.stop_reason === "tool_use") {
  const call = response.content.find(b => b.type === "tool_use");
  const result = await getWeather(call.input); // your function
 
  // Continue the conversation with the result
  const followUp = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    tools,
    messages: [
      { role: "user", content: "What's the weather in Sydney?" },
      { role: "assistant", content: response.content },
      {
        role: "user",
        content: [{ type: "tool_result", tool_use_id: call.id, content: JSON.stringify(result) }]
      }
    ]
  });
}

Full pattern in Tool use.


Vision (image input)

Pass image content blocks in user messages:

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: [
      { type: "image", source: { type: "base64", media_type: "image/png", data: "..." } },
      { type: "text", text: "What's in this image?" }
    ]
  }]
});

Images count toward input tokens (roughly proportional to resolution). See Multimodal.


Prompt caching

For repeated long prompts (system prompt + retrieved docs), mark stable parts as cacheable. Reduces input cost by ~90% for cached portions.

const response = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: "You are an expert legal researcher...",
      cache_control: { type: "ephemeral" }
    }
  ],
  messages: [{ role: "user", content: "Question about Australian contract law..." }]
});

Read Tokens & context windows for the caching strategy section.


Rate limits and error handling

The API has tier-based rate limits — requests per minute, tokens per minute, tokens per day. New keys start at lower tiers; you graduate as your usage builds up history.

Common error responses:

HTTP codeMeaningWhat to do
400Bad request — malformed JSON, missing field, invalid modelFix the request
401Unauthorized — bad API keyCheck key
403Forbidden — billing issue, blocked regionCheck account
404Endpoint not foundCheck URL/version
429Rate limited — too many requestsBack off and retry
500Server error on Anthropic’s sideRetry with exponential backoff
529Overloaded — temporary capacity issueRetry with backoff

For production, wrap calls with retry-with-backoff for 429s and 5xx errors. The official SDKs include this by default.


Best practices for production

A few patterns that come up consistently:

1. Pin model versions

Use exact versions (claude-opus-4-8), never aliases (claude-opus). Avoid silent behavior changes.

2. Cap max_tokens thoughtfully

Setting max_tokens: 4096 for tasks that need 50 tokens doesn’t cost extra (you pay for what’s generated), but the cap should match a reasonable upper bound for your task.

3. Use the system prompt for stable instructions

Static rules go in system. They’re cached if you use prompt caching. The model treats system content with higher importance.

4. Keep conversation history short

For long chats, periodically summarize older messages and replace them with the summary. The context window is finite.

5. Stream for interactive UX

Any UI showing the response should stream. Don’t make users stare at a spinner for 8 seconds.

6. Log usage for monitoring

The usage field tells you cost per call. Log it. You’ll quickly see which features are expensive and which aren’t.

7. Handle tool errors gracefully

When a tool fails, return the error as a tool_result. Don’t throw. Let the model decide whether to retry, try differently, or report.

8. Use the SDK

Even for simple calls. SDKs handle retries, streaming, type safety, edge cases.


Alternative interfaces — when not to use the API directly

  • Claude.ai — for casual use, document analysis, writing help. No code.
  • Claude Code — for coding work. Wraps the API with file system tools, memory, slash commands, hooks.
  • Anthropic SDK + cloud orchestration — for production agents at scale.
  • Bedrock / Vertex AI — Claude is also available through AWS Bedrock and Google Vertex AI, which can simplify enterprise procurement and compliance.

For “build a custom AI feature in my app,” the API is the path. For “I want to use AI to help me do X today,” claude.ai or Claude Code is usually faster.


Common gotchas

  • API key in client code. Catastrophic. Anyone visiting your site can read it. API calls go server-side only.

  • Forgetting max_tokens. It’s required. Omitting it returns an error.

  • Model alias drift. Using "claude-opus" (alias) instead of "claude-opus-4-8" (exact) means Anthropic can swap you to a newer model silently. Fine for casual use; bad for production reproducibility.

  • Forgetting to send the full history. The API is stateless. If your second call only includes message 2, the model has no memory of message 1. Build the full conversation each time.

  • Confusing system prompt with first user message. A system prompt and a first user message produce very different behaviors. The system prompt is “what you always do”; user messages are “the actual conversation.”

  • Streaming gotcha: parse the deltas correctly. Streaming yields different event types (message_start, content_block_delta, message_delta, message_stop). Only content_block_delta events with delta.type === "text_delta" contain text tokens.

  • max_tokens is the maximum, not a target. The model may produce shorter responses if it finishes naturally.

  • Context window includes both input and output. A 200K window means input + reserved output ≤ 200K. If your prompt is 195K, you only have 5K for the response.

  • Vision tokens are real and significant. A high-res image can cost hundreds of tokens before any text. Resize before sending if quality allows.

  • Tool use is multi-round. A single API call may not “complete” the task. You handle the loop yourself: call → tool_use response → execute tool → call again with results → repeat.

  • Rate limits surprise. Suddenly hitting 429s in production usually means traffic exceeded a tier limit. Plan for graduated tiers as you scale.

  • Cost surprises. A subtle change (longer system prompt, more retrieved context, agent loops that don’t terminate) can multiply costs. Monitor.

  • Hardcoded strings in code. Putting prompts directly in source means changing the prompt requires a deploy. Consider moving prompts to a config file or DB.

  • No built-in conversation memory. If you want chat persistence across sessions, you store the conversation yourself (in a DB, in localStorage, etc.) and re-send on each call.

  • JSON schema for tools is enforced, but specifically. Numeric ranges, enum values, required fields — the model adheres to your schema but only what you define. Vague schemas produce vague outputs.

  • response.content[0].text assumes the response has text content. With tool use, the first content block may be a tool_use. Iterate or check types.


See also

Sources