🇺🇸 USA · Fireworks AI

Status: 🟩 COMPLETE 🟦 LIVING Last updated: 2026-06-26 Plain-English tagline: Fast, cheap inference for open-weight AI models. Together AI’s closest competitor — Llama / Mistral / DeepSeek / Qwen on Western infrastructure with sub-second response times.

Front-matter facts

Field	Value
Vendor	Fireworks AI (San Francisco, USA)
Country / origin	🇺🇸 USA
Recommended for Australian users?	✅ Yes — fully accessible from AUS
Privacy summary	No training on customer data; HIPAA-eligible options
Free tier	Sign-up credit
Paid tiers	Pay-per-token; Enterprise quoted; dedicated deployment available
First released	2022
Last reviewed	2026-06-26
Official site	https://fireworks.ai

What it is

Fireworks AI is a fast, cheap inference platform for open-weight AI models — Together AI’s closest direct competitor. Founded by ex-Meta PyTorch engineers, Fireworks specialises in optimised inference (low-latency + high-throughput) for the most-used open-weight models.

Hosted models:

Llama (4 / 5 family + Code + Guard + Vision variants)
Mistral (Large, Codestral, Mixtral)
DeepSeek (open weights on Western infrastructure — safer than direct)
Qwen (similar)
Phi (Microsoft small models)
Yi, Gemma, Granite, plus 50+ others

Distinguishing features:

FireOptimizer — auto-tuning for fast inference per model
Speculative decoding for speed
Function calling support standardised across models
Fine-tuning with LoRA / full fine-tunes
Dedicated deployment with reserved GPUs for enterprise

What you’d use it for

Production open-weight inference with low latency
Multi-model strategy — switch between Llama / Mistral / Qwen easily
Cheaper alternative to frontier APIs when open-weight quality is sufficient
Fine-tuning open-weight models on your data
Running Chinese open weights safely (DeepSeek, Qwen via US infrastructure)
Function calling with open-weight models in production

How to use from Australia

Sign up at fireworks.ai. Free credit on sign-up.
Get API key

Use OpenAI-compatible endpoint:

from openai import OpenAI
client = OpenAI(
    api_key="...",
    base_url="https://api.fireworks.ai/inference/v1"
)
response = client.chat.completions.create(
    model="accounts/fireworks/models/llama4-70b-instruct",
    messages=[{"role": "user", "content": "Hello"}]
)

AUS card accepted

What it costs

Pay-per-token

Llama 4 70B: ~US $0.20/$ 0.30 per million input/output tokens
Llama 4 405B: ~US $0.90/$ 0.90 per million
DeepSeek: similar rates
Mistral Large: ~US $2/$ 6 per million

Generally competitive with Together AI — both significantly cheaper than frontier closed APIs.

Dedicated deployments

Reserved GPUs for guaranteed latency
Monthly commitment

Fine-tuning

Per-token training pricing
LoRA or full fine-tuning

How it compares to alternatives

Aspect	Fireworks AI	Together AI	Groq	Replicate
Model catalog	Curated open-weight	Broader open-weight	Curated (Llama-focused)	Vast (community)
Speed	Fast (FireOptimizer)	Fast	Fastest (LPU)	Variable
Price	Cheap	Cheap	Premium for speed	Per-second
Fine-tuning	Strong	Strong	Limited	Limited
Function calling	Strong standardisation	Yes	Yes	Yes
Best for	Production with fine-tuning	Production breadth	Speed-critical	Niche / experimental

For production open-weight workloads with fine-tuning needs, Fireworks is a strong default.

Privacy / data handling

No training on customer data — committed
HIPAA-eligible deployments for healthcare workloads
US data centres
Running Chinese open weights = data does NOT go to China (same as Together)

Recent changes

2026: FireOptimizer 2 + Llama 5 family
2025: HIPAA tier launched
2024: Major catalog expansion

Gotchas

For frontier closed models, Fireworks can’t help — use Anthropic / OpenAI / Google direct
For widest model catalog, Hugging Face / Replicate broader
For fastest inference, Groq beats Fireworks
Fine-tuning costs scale with token count — budget carefully

Tech & AI, Explained

Explorer

fireworks-ai