AI Hardware Overview — The Chips Behind Artificial Intelligence
Status: 🟩 COMPLETE 🟦 LIVING Tags: AI-hardware, GPU, TPU, NPU, NVIDIA, AMD, Apple-Silicon, Groq, AI-chips
Why hardware matters for AI
AI models — particularly large language models and image generators — require enormous amounts of mathematical calculation. The hardware that performs these calculations determines:
- How fast AI runs (response speed)
- How expensive AI is to operate (costs passed to users)
- Who can run AI (only companies with massive server farms, or everyone on their laptop)
- The environmental footprint (energy use)
Understanding the hardware landscape helps you make sense of why some AI is expensive and slow, why “local AI” is becoming possible, and why NVIDIA is currently one of the world’s most valuable companies.
The main types of AI hardware
GPUs (Graphics Processing Units) — the dominant AI chip
GPUs were originally designed for rendering video game graphics. Graphics requires calculating millions of pixels simultaneously — the GPU does this with thousands of tiny parallel processing cores (vs a CPU’s handful of powerful cores).
This “massively parallel” architecture turns out to be exactly what AI training and inference needs: AI involves doing the same mathematical operation (matrix multiplication) simultaneously on millions of numbers.
NVIDIA dominates this market. The NVIDIA H100 and H200 GPUs are the standard for AI training; the new NVIDIA Blackwell B200 (2024–2025) is the current state-of-the-art. These cards cost 40,000 each and are often backordered for months.
Why NVIDIA? Not just hardware — NVIDIA’s CUDA software platform (released 2006, long before the AI boom) created an enormous software ecosystem. Every major AI framework (PyTorch, TensorFlow) runs on CUDA. Switching to AMD or Intel GPUs for AI requires rewriting software — a significant barrier NVIDIA built intentionally.
AMD GPUs: AMD’s Instinct MI300X is a legitimate competitor to NVIDIA H100 for some workloads. AMD has been narrowing the gap. ROCm (AMD’s CUDA equivalent) is improving. But NVIDIA maintains a significant software and ecosystem lead.
TPUs (Tensor Processing Units) — Google’s custom AI chip
Google built their own custom chip specifically for AI workloads, optimised for the specific mathematical operations AI models need. TPUs are used for:
- Training Google’s own models (Gemini was trained on TPUs)
- Available to developers through Google Cloud (TPU VMs)
- More energy-efficient than equivalent GPU workloads for certain models
v5e and v5p TPUs (2024): Google’s latest generation. Not available for purchase — only via Google Cloud.
NPUs (Neural Processing Units) — AI chips in your devices
An NPU is a small, power-efficient chip embedded in consumer devices (phones, laptops) specifically to run AI inference locally — on the device, without sending data to a server.
- Apple Neural Engine: In every Apple Silicon chip (M1, M2, M3, M4, A15, A16, A17, A18). Powers Face ID, Siri, on-device translation, Pixelmator Pro’s ML features. Very efficient; highly optimised for Apple’s specific models.
- Qualcomm Hexagon NPU: In Snapdragon chips; powers Android on-device AI. Qualcomm Snapdragon 8 Gen 3 is a leader.
- Intel NPU: In Intel Core Ultra processors (Meteor Lake onwards). Powers Microsoft’s Copilot+ PC features.
- AMD XDNA NPU: In AMD Ryzen AI processors. Part of the Windows AI PC ecosystem.
NPUs enable: on-device AI (no internet required), privacy (data stays on device), low latency (no network round trip), low power consumption.
Groq LPU (Language Processing Unit)
Groq built a completely different architecture called an LPU — specifically designed for inference (running trained models), not training. Rather than GPU-style parallel computation, the LPU uses massive, deterministic memory bandwidth.
The result: extremely fast text generation — Groq’s LPU generates tokens at 500–800+ tokens per second (vs typical 100–200 for GPU-based inference). This translates to near-instant response from AI chatbots.
Groq offers a cloud API service using their LPUs. As of 2026, they support Llama, Mixtral, and Gemma models.
Cerebras WSE (Wafer Scale Engine)
See cerebras for full detail. The WSE is a dinner-plate-sized single chip — the world’s largest chip — with 4 trillion transistors. Even faster than Groq for very large models due to enormous on-chip memory.
Custom AI chips from major tech companies
| Company | Chip | Use |
|---|---|---|
| TPU v5, Axion | Gemini training and inference | |
| Apple | Neural Engine | On-device; M-series; A-series |
| Amazon | Trainium (training), Inferentia (inference) | AWS Bedrock; internal Amazon AI |
| Microsoft | Maia (in development) | Azure AI; training |
| Meta | MTIA (Meta Training and Inference Accelerator) | Internal training |
| Tesla/xAI | Dojo (training supercomputer chip) | Autonomous driving; Grok |
The pattern: every major AI company eventually builds its own chip to reduce NVIDIA dependence and cut costs.
The NVIDIA stranglehold
NVIDIA’s current position in AI hardware is often compared to Microsoft’s dominance of PC operating systems in the 1990s. Key facts:
- Market share: NVIDIA claims 70–95% of the AI training chip market (estimates vary)
- Valuation: NVIDIA briefly became the world’s most valuable public company in 2024 (surpassing Apple and Microsoft)
- H100 shortage: In 2022–2023, AI companies were paying 40,000 per chip on the secondary market; cloud providers had waitlists
- Blackwell (B200): Released 2024–2025; 5× more powerful than H100 for AI; demand is again outstripping supply
The CUDA moat: NVIDIA’s software platform (CUDA, cuDNN, NCCL) has been the standard for AI since 2007. Nearly all AI research software assumes CUDA. Switching chips requires significant re-engineering.
AMD’s challenge: AMD’s ROCm platform is a growing alternative, but still significantly behind CUDA in maturity and ease-of-use. AMD is competitive on hardware; the software ecosystem is the gap.
Consumer AI hardware (for running AI locally)
With quantised small models (Llama 3.2 3B, Phi-3 Mini), you can now run local AI on consumer hardware:
| Hardware | RAM | What you can run |
|---|---|---|
| Mac with M1/M2/M3/M4 | 16GB unified | 7B–13B parameter models at decent speed |
| Mac with M1/M2/M3/M4 | 32GB+ | 30B models; reasonable speed |
| PC with 16GB GPU (RTX 4090) | 24GB VRAM | 30B models at good speed |
| PC with 2× H100s | 80GB+ VRAM | 70B models |
| Standard laptop (no GPU) | 16GB RAM | 3B–7B models; slow |
Apple Silicon Macs are notably good for local AI because their “unified memory” (shared between CPU and GPU) allows the full RAM to be used for model weights — unlike PCs where model weights must fit in the GPU’s VRAM.
Ollama and LM Studio are the easiest tools for running local models. See open-weights-vs-closed for more on local AI.
AI chips in Australian context
- Australia has no domestic semiconductor manufacturing industry — no chip fabs
- Australia relies entirely on imported AI hardware (primarily from Taiwan via TSMC, which manufactures most advanced chips for NVIDIA, Apple, AMD, etc.)
- Australian AI infrastructure runs on US cloud hardware
- TSMC’s dominance in Taiwan creates a geopolitical risk scenario that is actively discussed in Australian government circles
- The AUKUS technology cooperation agreement includes semiconductor/AI hardware cooperation
Gotchas
- “GPU” is not the same as “graphics card.” Consumer graphics cards (RTX 4090, Radeon RX 7900 XTX) can run AI but aren’t the same as data centre AI cards (H100, MI300X). The data centre versions have ECC memory, faster interconnects, and are built for 24/7 operation.
- VRAM is the bottleneck, not compute. For running AI models, the GPU’s VRAM (video memory) usually limits what you can run, more than the processing speed. A 24GB GPU can run a 13B parameter model in full precision.
- Quantisation enables smaller hardware. Reducing model precision (from 32-bit floats to 4-bit integers) allows much larger models to run in less VRAM with minimal quality loss. This is what makes local AI practical on consumer hardware.
- Apple Silicon efficiency is remarkable. A MacBook with 32GB M3 Pro can run 13B parameter models faster than many PCs with dedicated GPUs, because of the unified memory and Neural Engine efficiency.
- AI chip supply remains constrained. As of mid-2026, demand for AI compute still significantly outpaces supply. Cloud providers often have waitlists for the newest GPU generations.
See also
- nvidia-ai — NVIDIA as an AI company (full entry)
- cerebras — the alternative chip architecture
- groq — LPU for fast inference
- open-weights-vs-closed — running local models on your own hardware
- ai-energy-footprint — energy implications of AI hardware
- coreweave — GPU cloud built on NVIDIA hardware
Sources
- NVIDIA H100 and B200 technical specifications (2023–2025)
- Google Cloud TPU documentation and technical papers
- Qualcomm Snapdragon 8 Gen 3 NPU specifications
- Apple Neural Engine documentation (developer.apple.com)
- TSMC manufacturing process nodes (2023–2025)
- Goldman Sachs AI chip market analysis (2024)
- Morgan Stanley — “The AI Infrastructure Supercycle” (2024)
- CHIPS Act and semiconductor supply chain analysis — Brookings Institution (2023)
- Groq LPU technical whitepaper (groq.com)