AI Hardware Overview — The Chips Behind Artificial Intelligence

Status: 🟩 COMPLETE 🟦 LIVING Tags: AI-hardware, GPU, TPU, NPU, NVIDIA, AMD, Apple-Silicon, Groq, AI-chips

Why hardware matters for AI

AI models — particularly large language models and image generators — require enormous amounts of mathematical calculation. The hardware that performs these calculations determines:

How fast AI runs (response speed)
How expensive AI is to operate (costs passed to users)
Who can run AI (only companies with massive server farms, or everyone on their laptop)
The environmental footprint (energy use)

Understanding the hardware landscape helps you make sense of why some AI is expensive and slow, why “local AI” is becoming possible, and why NVIDIA is currently one of the world’s most valuable companies.

The main types of AI hardware

GPUs (Graphics Processing Units) — the dominant AI chip

GPUs were originally designed for rendering video game graphics. Graphics requires calculating millions of pixels simultaneously — the GPU does this with thousands of tiny parallel processing cores (vs a CPU’s handful of powerful cores).

This “massively parallel” architecture turns out to be exactly what AI training and inference needs: AI involves doing the same mathematical operation (matrix multiplication) simultaneously on millions of numbers.

NVIDIA dominates this market. The NVIDIA H100 and H200 GPUs are the standard for AI training; the new NVIDIA Blackwell B200 (2024–2025) is the current state-of-the-art. These cards cost $25, 000-$ 40,000 each and are often backordered for months.

Why NVIDIA? Not just hardware — NVIDIA’s CUDA software platform (released 2006, long before the AI boom) created an enormous software ecosystem. Every major AI framework (PyTorch, TensorFlow) runs on CUDA. Switching to AMD or Intel GPUs for AI requires rewriting software — a significant barrier NVIDIA built intentionally.

AMD GPUs: AMD’s Instinct MI300X is a legitimate competitor to NVIDIA H100 for some workloads. AMD has been narrowing the gap. ROCm (AMD’s CUDA equivalent) is improving. But NVIDIA maintains a significant software and ecosystem lead.

TPUs (Tensor Processing Units) — Google’s custom AI chip

Google built their own custom chip specifically for AI workloads, optimised for the specific mathematical operations AI models need. TPUs are used for:

Training Google’s own models (Gemini was trained on TPUs)
Available to developers through Google Cloud (TPU VMs)
More energy-efficient than equivalent GPU workloads for certain models

v5e and v5p TPUs (2024): Google’s latest generation. Not available for purchase — only via Google Cloud.

NPUs (Neural Processing Units) — AI chips in your devices

An NPU is a small, power-efficient chip embedded in consumer devices (phones, laptops) specifically to run AI inference locally — on the device, without sending data to a server.

Apple Neural Engine: In every Apple Silicon chip (M1, M2, M3, M4, A15, A16, A17, A18). Powers Face ID, Siri, on-device translation, Pixelmator Pro’s ML features. Very efficient; highly optimised for Apple’s specific models.
Qualcomm Hexagon NPU: In Snapdragon chips; powers Android on-device AI. Qualcomm Snapdragon 8 Gen 3 is a leader.
Intel NPU: In Intel Core Ultra processors (Meteor Lake onwards). Powers Microsoft’s Copilot+ PC features.
AMD XDNA NPU: In AMD Ryzen AI processors. Part of the Windows AI PC ecosystem.

NPUs enable: on-device AI (no internet required), privacy (data stays on device), low latency (no network round trip), low power consumption.

Groq LPU (Language Processing Unit)

Groq built a completely different architecture called an LPU — specifically designed for inference (running trained models), not training. Rather than GPU-style parallel computation, the LPU uses massive, deterministic memory bandwidth.

The result: extremely fast text generation — Groq’s LPU generates tokens at 500–800+ tokens per second (vs typical 100–200 for GPU-based inference). This translates to near-instant response from AI chatbots.

Groq offers a cloud API service using their LPUs. As of 2026, they support Llama, Mixtral, and Gemma models.

Cerebras WSE (Wafer Scale Engine)

See cerebras for full detail. The WSE is a dinner-plate-sized single chip — the world’s largest chip — with 4 trillion transistors. Even faster than Groq for very large models due to enormous on-chip memory.

Custom AI chips from major tech companies

Company	Chip	Use
Google	TPU v5, Axion	Gemini training and inference
Apple	Neural Engine	On-device; M-series; A-series
Amazon	Trainium (training), Inferentia (inference)	AWS Bedrock; internal Amazon AI
Microsoft	Maia (in development)	Azure AI; training
Meta	MTIA (Meta Training and Inference Accelerator)	Internal training
Tesla/xAI	Dojo (training supercomputer chip)	Autonomous driving; Grok

The pattern: every major AI company eventually builds its own chip to reduce NVIDIA dependence and cut costs.

The NVIDIA stranglehold

NVIDIA’s current position in AI hardware is often compared to Microsoft’s dominance of PC operating systems in the 1990s. Key facts:

Market share: NVIDIA claims 70–95% of the AI training chip market (estimates vary)
Valuation: NVIDIA briefly became the world’s most valuable public company in 2024 (surpassing Apple and Microsoft)
H100 shortage: In 2022–2023, AI companies were paying $30, 000-$ 40,000 per chip on the secondary market; cloud providers had waitlists
Blackwell (B200): Released 2024–2025; 5× more powerful than H100 for AI; demand is again outstripping supply

The CUDA moat: NVIDIA’s software platform (CUDA, cuDNN, NCCL) has been the standard for AI since 2007. Nearly all AI research software assumes CUDA. Switching chips requires significant re-engineering.

AMD’s challenge: AMD’s ROCm platform is a growing alternative, but still significantly behind CUDA in maturity and ease-of-use. AMD is competitive on hardware; the software ecosystem is the gap.

Consumer AI hardware (for running AI locally)

With quantised small models (Llama 3.2 3B, Phi-3 Mini), you can now run local AI on consumer hardware:

Hardware	RAM	What you can run
Mac with M1/M2/M3/M4	16GB unified	7B–13B parameter models at decent speed
Mac with M1/M2/M3/M4	32GB+	30B models; reasonable speed
PC with 16GB GPU (RTX 4090)	24GB VRAM	30B models at good speed
PC with 2× H100s	80GB+ VRAM	70B models
Standard laptop (no GPU)	16GB RAM	3B–7B models; slow

Apple Silicon Macs are notably good for local AI because their “unified memory” (shared between CPU and GPU) allows the full RAM to be used for model weights — unlike PCs where model weights must fit in the GPU’s VRAM.

Ollama and LM Studio are the easiest tools for running local models. See open-weights-vs-closed for more on local AI.

AI chips in Australian context

Australia has no domestic semiconductor manufacturing industry — no chip fabs
Australia relies entirely on imported AI hardware (primarily from Taiwan via TSMC, which manufactures most advanced chips for NVIDIA, Apple, AMD, etc.)
Australian AI infrastructure runs on US cloud hardware
TSMC’s dominance in Taiwan creates a geopolitical risk scenario that is actively discussed in Australian government circles
The AUKUS technology cooperation agreement includes semiconductor/AI hardware cooperation

Gotchas

“GPU” is not the same as “graphics card.” Consumer graphics cards (RTX 4090, Radeon RX 7900 XTX) can run AI but aren’t the same as data centre AI cards (H100, MI300X). The data centre versions have ECC memory, faster interconnects, and are built for 24/7 operation.
VRAM is the bottleneck, not compute. For running AI models, the GPU’s VRAM (video memory) usually limits what you can run, more than the processing speed. A 24GB GPU can run a 13B parameter model in full precision.
Quantisation enables smaller hardware. Reducing model precision (from 32-bit floats to 4-bit integers) allows much larger models to run in less VRAM with minimal quality loss. This is what makes local AI practical on consumer hardware.
Apple Silicon efficiency is remarkable. A MacBook with 32GB M3 Pro can run 13B parameter models faster than many PCs with dedicated GPUs, because of the unified memory and Neural Engine efficiency.
AI chip supply remains constrained. As of mid-2026, demand for AI compute still significantly outpaces supply. Cloud providers often have waitlists for the newest GPU generations.

Sources

NVIDIA H100 and B200 technical specifications (2023–2025)
Google Cloud TPU documentation and technical papers
Qualcomm Snapdragon 8 Gen 3 NPU specifications
Apple Neural Engine documentation (developer.apple.com)
TSMC manufacturing process nodes (2023–2025)
Goldman Sachs AI chip market analysis (2024)
Morgan Stanley — “The AI Infrastructure Supercycle” (2024)
CHIPS Act and semiconductor supply chain analysis — Brookings Institution (2023)
Groq LPU technical whitepaper (groq.com)

Tech & AI, Explained

Explorer

ai-hardware-overview