AI Speech-to-Text — How Machines Transcribe and Understand Audio
Status: 🟩 COMPLETE 🟦 LIVING Section: 10 — AI and LLMs Tags: speech-to-text, STT, transcription, whisper, deepgram, assemblyai, captions, dictation
What it is
AI speech-to-text (also called STT, transcription, or automatic speech recognition / ASR) is the ability for a computer to listen to audio — a voice recording, a phone call, a meeting, a video — and convert what’s spoken into written text, automatically.
Modern AI speech-to-text:
- Achieves human-level accuracy on clear audio in major languages
- Can identify who said what in a multi-person conversation (speaker diarisation)
- Can add punctuation, paragraphs, and formatting automatically
- Can understand accents, background noise, and overlapping speech (to varying degrees)
- Can work in real-time (sub-second latency) or on pre-recorded files
This powers: Zoom’s live captions, iPhone dictation, podcast transcription, YouTube auto-captions, customer service call analysis, meeting notes, and much more.
How it works (plain English)
Audio is a continuous wave of sound — fundamentally different from text. AI speech-to-text works in stages:
-
Feature extraction: The audio is broken into tiny time slices (usually 10–25 milliseconds each). Each slice is analysed for its frequency pattern — essentially, what mix of pitches it contains. This creates a visual map of the audio called a spectrogram (think of it as a kind of heatmap showing which sound frequencies are active at each moment in time).
-
Pattern matching with neural networks: A transformer-based neural network (the same type of architecture as GPT — see how-llms-work) processes these frequency patterns and maps them to words and phrases. It’s been trained on thousands of hours of human speech so it knows what “hello” sounds like vs “yellow,” even from different speakers with different accents.
-
Language model assistance: A secondary language model helps the transcription make sense in context. “The new fiscal year started in April” vs “the new physical ear started in April” — the language model knows which interpretation is more likely given the surrounding words.
-
Output formatting: Punctuation, speaker labels, and timestamps are added, either by rule-based systems or by additional AI models.
The major speech-to-text tools (mid-2026)
Open / developer
| Tool | Country | Best for | Free? |
|---|---|---|---|
| Whisper (OpenAI, open-source) | 🇺🇸 | Run locally; excellent quality; 99 languages; free to use | Yes (open-source) |
| Faster Whisper (community) | 🇺🇸 | Whisper but 4× faster; GPU-optimised | Yes (open-source) |
API / developer services
| Tool | Country | Best for | Free tier? |
|---|---|---|---|
| Deepgram Nova-3 | 🇺🇸 | Fastest + most accurate API; real-time STT | Yes (~45 hrs/month free) |
| AssemblyAI | 🇺🇸 | Transcription + AI summaries + sentiment; podcast/video pipeline | Yes (limited) |
| Rev AI | 🇺🇸 | Reliable; human review option | Pay-per-minute |
| AWS Transcribe | 🇺🇸 | AWS-native; medical variant available | Free tier (60 min/month) |
| Google Cloud Speech-to-Text | 🇺🇸 | 125 languages; strong for global deployments | Free tier (60 min/month) |
| Azure Speech | 🇺🇸 | Microsoft ecosystem; real-time + batch | Free tier (5 hrs/month) |
Consumer / ready-to-use apps
| Tool | Country | Best for |
|---|---|---|
| Otter.ai | 🇺🇸 | Meeting transcription; Google Meet / Zoom integration |
| Fireflies.ai | 🇺🇸 | Meeting notes, action items, CRM sync |
| Grain | 🇺🇸 | Sales call transcription and coaching |
| Whisper (via apps) | 🇺🇸 | Many apps wrap Whisper for easy use |
Built into devices/software
- iPhone/iPad dictation — Apple’s own STT model, runs on-device (private), English/multilingual
- Android dictation — Google’s STT, strong quality
- Microsoft Word dictation — Azure Speech under the hood
- Zoom live captions — built-in, free
- YouTube auto-captions — Google’s STT; remarkable quality now
Key concepts
Real-time vs batch: Real-time STT transcribes as you speak (latency in milliseconds — used in live captioning, voice assistants). Batch STT processes a pre-recorded file and is typically more accurate because it can see the full context.
Speaker diarisation (or diarization): Identifying who said what. “Speaker A said… Speaker B replied…” This is used in meeting transcription tools. It’s imperfect with more than 4–5 speakers or heavy crosstalk.
Word error rate (WER): The standard accuracy metric — the percentage of words the AI got wrong compared to a human reference. Top models now achieve 3–5% WER on clear English audio (human-level). Noisy environments or accented speech push this higher.
Hallucination in STT: Like language models, STT systems can “fill in” words they didn’t actually hear, especially in silence, noise, or unclear speech. Whisper is known to occasionally hallucinate text in silent sections. Always verify important transcripts.
Punctuation restoration: Raw STT output has no punctuation. AI models (or a post-processing step) add full stops, commas, question marks, and paragraph breaks based on speech patterns. Quality varies; some tools do it much better than others.
Custom vocabulary / hotwords: Training the model to correctly recognise unusual words — brand names, medical terms, technical jargon — that might otherwise be transcribed wrong. Many API providers support this.
Language detection: Automatically detecting what language is being spoken. Whisper does this natively.
Code-switching: When a speaker switches between languages mid-sentence (common in multilingual communities). A significant challenge for STT. Some specialised models handle this.
Accuracy: what actually affects it
| Factor | Effect |
|---|---|
| Clear audio, close microphone | Highest accuracy (3–5% WER) |
| Noisy environment | Accuracy drops; strong models can partially compensate |
| Strong regional accent | Varies by model; top models are improving |
| Multiple speakers talking at once | Accuracy drops significantly |
| Technical vocabulary (medical, legal, engineering) | Poor without domain fine-tuning |
| Emotional speech (crying, shouting) | Often less accurate |
| Children’s voices | Notoriously difficult; most models trained on adult speech |
| Non-native speaker with accent | Variable; getting better with diverse training data |
What speech-to-text is used for (real examples)
- Meeting notes: Otter, Fireflies, Grain attend your Zoom/Teams meeting and produce a full transcript + summary automatically.
- Podcast production: Transcribe episodes for show notes, SEO, and accessibility. Descript uses the transcript to edit audio by editing text.
- Video captions: YouTube auto-captions; accessibility compliance; reaching global audiences.
- Voice dictation: Write documents, emails, and messages by speaking — especially useful for accessibility or when typing is slow.
- Customer service analytics: Transcribe every support call; AI finds patterns, compliance issues, and coaching moments.
- Legal and medical documentation: Dictated notes → structured records. Specialised models trained on legal/medical vocabulary.
- Voice search and commands: Siri, Google Assistant, Alexa transcribe your voice to understand intent.
- Language learning: Pronunciation feedback by comparing your speech transcription to the expected text.
- Journalism: Transcribe interviews automatically. Saves hours of manual work.
Gotchas
- Hallucination in silences: Whisper (especially older versions) sometimes generates text during silent pauses. Always review transcripts of recordings with long silences.
- Punctuation quality varies widely. AssemblyAI’s punctuation is better than many others; raw Whisper output sometimes needs a cleanup pass.
- Speaker diarisation is not magic. If two people have similar voices, or speakers are far from the microphone, diarisation gets confused. Treat speaker labels as “approximately right.”
- Accents and dialects: Australian, South African, and Indian accents are generally supported in top tools but may have higher error rates than American English. Test on your actual speakers.
- Privacy for sensitive content: Sending audio of confidential meetings to a third-party API is a privacy risk. Consider on-device (Whisper local) for sensitive data.
- Audio quality matters more than model quality. A clean 128 kbps recording through a cheap microphone will transcribe far better than a noisy 320 kbps file. Fix the recording before fixing the model.
- Numbers and symbols: “Four hundred and twenty” vs “420” — formats often inconsistent. May need a post-processing step for standardised formatting.
- Different accents of the same language: Australian English vs US English vs UK English vs Indian English all “count” as English but are meaningfully different for STT models. Check which accent your target audience uses.
Pricing (mid-2026)
Most APIs charge per minute or per hour of audio:
| Service | Approximate price |
|---|---|
| Deepgram Nova-3 | ~0.26/hour) |
| AssemblyAI | ~0.39/hour) |
| Google Cloud STT | ~0.016/minute depending on feature |
| AWS Transcribe | ~$0.024/minute |
| Whisper (local) | Free (your own hardware cost) |
| Whisper (via OpenAI API) | ~$0.006/minute |
A 1-hour podcast episode costs roughly 1.50 to transcribe via API.
See also
- Whisper — OpenAI’s open-source transcription model
- voice-synthesis — the reverse: text → speech
- real-time-voice-ai — full two-way spoken AI conversation
- ai-translation — translate the transcribed text into other languages
- multimodal-vision-audio — AI that processes audio as input
Sources
- OpenAI Whisper paper and documentation (2022–2026)
- Deepgram Nova-3 announcement and benchmarks (2024–2026)
- AssemblyAI Universal-2 documentation (2024–2026)
- Google Cloud Speech-to-Text API documentation
- NIST speech recognition benchmarks
- Koenecke et al., “Racial disparities in automated speech recognition” (PNAS 2020) — accuracy differences across accents
- Rev.com industry transcription accuracy reports (2024)