πŸ‡ΊπŸ‡Έ USA Β· Fireworks AI

Status: 🟩 COMPLETE 🟦 LIVING Last updated: 2026-06-26 Plain-English tagline: Fast, cheap inference for open-weight AI models. Together AI’s closest competitor β€” Llama / Mistral / DeepSeek / Qwen on Western infrastructure with sub-second response times.


Front-matter facts

FieldValue
VendorFireworks AI (San Francisco, USA)
Country / originπŸ‡ΊπŸ‡Έ USA
Recommended for Australian users?βœ… Yes β€” fully accessible from AUS
Privacy summaryNo training on customer data; HIPAA-eligible options
Free tierSign-up credit
Paid tiersPay-per-token; Enterprise quoted; dedicated deployment available
First released2022
Last reviewed2026-06-26
Official sitehttps://fireworks.ai

What it is

Fireworks AI is a fast, cheap inference platform for open-weight AI models β€” Together AI’s closest direct competitor. Founded by ex-Meta PyTorch engineers, Fireworks specialises in optimised inference (low-latency + high-throughput) for the most-used open-weight models.

Hosted models:

  • Llama (4 / 5 family + Code + Guard + Vision variants)
  • Mistral (Large, Codestral, Mixtral)
  • DeepSeek (open weights on Western infrastructure β€” safer than direct)
  • Qwen (similar)
  • Phi (Microsoft small models)
  • Yi, Gemma, Granite, plus 50+ others

Distinguishing features:

  • FireOptimizer β€” auto-tuning for fast inference per model
  • Speculative decoding for speed
  • Function calling support standardised across models
  • Fine-tuning with LoRA / full fine-tunes
  • Dedicated deployment with reserved GPUs for enterprise

What you’d use it for

  • Production open-weight inference with low latency
  • Multi-model strategy β€” switch between Llama / Mistral / Qwen easily
  • Cheaper alternative to frontier APIs when open-weight quality is sufficient
  • Fine-tuning open-weight models on your data
  • Running Chinese open weights safely (DeepSeek, Qwen via US infrastructure)
  • Function calling with open-weight models in production

How to use from Australia

  1. Sign up at fireworks.ai. Free credit on sign-up.
  2. Get API key
  3. Use OpenAI-compatible endpoint:
    from openai import OpenAI
    client = OpenAI(
        api_key="...",
        base_url="https://api.fireworks.ai/inference/v1"
    )
    response = client.chat.completions.create(
        model="accounts/fireworks/models/llama4-70b-instruct",
        messages=[{"role": "user", "content": "Hello"}]
    )
  4. AUS card accepted

What it costs

Pay-per-token

  • Llama 4 70B: ~US0.30 per million input/output tokens
  • Llama 4 405B: ~US0.90 per million
  • DeepSeek: similar rates
  • Mistral Large: ~US6 per million

Generally competitive with Together AI β€” both significantly cheaper than frontier closed APIs.

Dedicated deployments

  • Reserved GPUs for guaranteed latency
  • Monthly commitment

Fine-tuning

  • Per-token training pricing
  • LoRA or full fine-tuning

How it compares to alternatives

AspectFireworks AITogether AIGroqReplicate
Model catalogCurated open-weightBroader open-weightCurated (Llama-focused)Vast (community)
SpeedFast (FireOptimizer)FastFastest (LPU)Variable
PriceCheapCheapPremium for speedPer-second
Fine-tuningStrongStrongLimitedLimited
Function callingStrong standardisationYesYesYes
Best forProduction with fine-tuningProduction breadthSpeed-criticalNiche / experimental

For production open-weight workloads with fine-tuning needs, Fireworks is a strong default.


Privacy / data handling

  • No training on customer data β€” committed
  • HIPAA-eligible deployments for healthcare workloads
  • US data centres
  • Running Chinese open weights = data does NOT go to China (same as Together)

Recent changes

  • 2026: FireOptimizer 2 + Llama 5 family
  • 2025: HIPAA tier launched
  • 2024: Major catalog expansion

Gotchas

  • For frontier closed models, Fireworks can’t help β€” use Anthropic / OpenAI / Google direct
  • For widest model catalog, Hugging Face / Replicate broader
  • For fastest inference, Groq beats Fireworks
  • Fine-tuning costs scale with token count β€” budget carefully

See also


Sources