AI Speech-to-Text — How Machines Transcribe and Understand Audio

Status: 🟩 COMPLETE 🟦 LIVING Section: 10 — AI and LLMs Tags: speech-to-text, STT, transcription, whisper, deepgram, assemblyai, captions, dictation


What it is

AI speech-to-text (also called STT, transcription, or automatic speech recognition / ASR) is the ability for a computer to listen to audio — a voice recording, a phone call, a meeting, a video — and convert what’s spoken into written text, automatically.

Modern AI speech-to-text:

  • Achieves human-level accuracy on clear audio in major languages
  • Can identify who said what in a multi-person conversation (speaker diarisation)
  • Can add punctuation, paragraphs, and formatting automatically
  • Can understand accents, background noise, and overlapping speech (to varying degrees)
  • Can work in real-time (sub-second latency) or on pre-recorded files

This powers: Zoom’s live captions, iPhone dictation, podcast transcription, YouTube auto-captions, customer service call analysis, meeting notes, and much more.


How it works (plain English)

Audio is a continuous wave of sound — fundamentally different from text. AI speech-to-text works in stages:

  1. Feature extraction: The audio is broken into tiny time slices (usually 10–25 milliseconds each). Each slice is analysed for its frequency pattern — essentially, what mix of pitches it contains. This creates a visual map of the audio called a spectrogram (think of it as a kind of heatmap showing which sound frequencies are active at each moment in time).

  2. Pattern matching with neural networks: A transformer-based neural network (the same type of architecture as GPT — see how-llms-work) processes these frequency patterns and maps them to words and phrases. It’s been trained on thousands of hours of human speech so it knows what “hello” sounds like vs “yellow,” even from different speakers with different accents.

  3. Language model assistance: A secondary language model helps the transcription make sense in context. “The new fiscal year started in April” vs “the new physical ear started in April” — the language model knows which interpretation is more likely given the surrounding words.

  4. Output formatting: Punctuation, speaker labels, and timestamps are added, either by rule-based systems or by additional AI models.


The major speech-to-text tools (mid-2026)

Open / developer

ToolCountryBest forFree?
Whisper (OpenAI, open-source)🇺🇸Run locally; excellent quality; 99 languages; free to useYes (open-source)
Faster Whisper (community)🇺🇸Whisper but 4× faster; GPU-optimisedYes (open-source)

API / developer services

ToolCountryBest forFree tier?
Deepgram Nova-3🇺🇸Fastest + most accurate API; real-time STTYes (~45 hrs/month free)
AssemblyAI🇺🇸Transcription + AI summaries + sentiment; podcast/video pipelineYes (limited)
Rev AI🇺🇸Reliable; human review optionPay-per-minute
AWS Transcribe🇺🇸AWS-native; medical variant availableFree tier (60 min/month)
Google Cloud Speech-to-Text🇺🇸125 languages; strong for global deploymentsFree tier (60 min/month)
Azure Speech🇺🇸Microsoft ecosystem; real-time + batchFree tier (5 hrs/month)

Consumer / ready-to-use apps

ToolCountryBest for
Otter.ai🇺🇸Meeting transcription; Google Meet / Zoom integration
Fireflies.ai🇺🇸Meeting notes, action items, CRM sync
Grain🇺🇸Sales call transcription and coaching
Whisper (via apps)🇺🇸Many apps wrap Whisper for easy use

Built into devices/software

  • iPhone/iPad dictation — Apple’s own STT model, runs on-device (private), English/multilingual
  • Android dictation — Google’s STT, strong quality
  • Microsoft Word dictation — Azure Speech under the hood
  • Zoom live captions — built-in, free
  • YouTube auto-captions — Google’s STT; remarkable quality now

Key concepts

Real-time vs batch: Real-time STT transcribes as you speak (latency in milliseconds — used in live captioning, voice assistants). Batch STT processes a pre-recorded file and is typically more accurate because it can see the full context.

Speaker diarisation (or diarization): Identifying who said what. “Speaker A said… Speaker B replied…” This is used in meeting transcription tools. It’s imperfect with more than 4–5 speakers or heavy crosstalk.

Word error rate (WER): The standard accuracy metric — the percentage of words the AI got wrong compared to a human reference. Top models now achieve 3–5% WER on clear English audio (human-level). Noisy environments or accented speech push this higher.

Hallucination in STT: Like language models, STT systems can “fill in” words they didn’t actually hear, especially in silence, noise, or unclear speech. Whisper is known to occasionally hallucinate text in silent sections. Always verify important transcripts.

Punctuation restoration: Raw STT output has no punctuation. AI models (or a post-processing step) add full stops, commas, question marks, and paragraph breaks based on speech patterns. Quality varies; some tools do it much better than others.

Custom vocabulary / hotwords: Training the model to correctly recognise unusual words — brand names, medical terms, technical jargon — that might otherwise be transcribed wrong. Many API providers support this.

Language detection: Automatically detecting what language is being spoken. Whisper does this natively.

Code-switching: When a speaker switches between languages mid-sentence (common in multilingual communities). A significant challenge for STT. Some specialised models handle this.


Accuracy: what actually affects it

FactorEffect
Clear audio, close microphoneHighest accuracy (3–5% WER)
Noisy environmentAccuracy drops; strong models can partially compensate
Strong regional accentVaries by model; top models are improving
Multiple speakers talking at onceAccuracy drops significantly
Technical vocabulary (medical, legal, engineering)Poor without domain fine-tuning
Emotional speech (crying, shouting)Often less accurate
Children’s voicesNotoriously difficult; most models trained on adult speech
Non-native speaker with accentVariable; getting better with diverse training data

What speech-to-text is used for (real examples)

  • Meeting notes: Otter, Fireflies, Grain attend your Zoom/Teams meeting and produce a full transcript + summary automatically.
  • Podcast production: Transcribe episodes for show notes, SEO, and accessibility. Descript uses the transcript to edit audio by editing text.
  • Video captions: YouTube auto-captions; accessibility compliance; reaching global audiences.
  • Voice dictation: Write documents, emails, and messages by speaking — especially useful for accessibility or when typing is slow.
  • Customer service analytics: Transcribe every support call; AI finds patterns, compliance issues, and coaching moments.
  • Legal and medical documentation: Dictated notes → structured records. Specialised models trained on legal/medical vocabulary.
  • Voice search and commands: Siri, Google Assistant, Alexa transcribe your voice to understand intent.
  • Language learning: Pronunciation feedback by comparing your speech transcription to the expected text.
  • Journalism: Transcribe interviews automatically. Saves hours of manual work.

Gotchas

  • Hallucination in silences: Whisper (especially older versions) sometimes generates text during silent pauses. Always review transcripts of recordings with long silences.
  • Punctuation quality varies widely. AssemblyAI’s punctuation is better than many others; raw Whisper output sometimes needs a cleanup pass.
  • Speaker diarisation is not magic. If two people have similar voices, or speakers are far from the microphone, diarisation gets confused. Treat speaker labels as “approximately right.”
  • Accents and dialects: Australian, South African, and Indian accents are generally supported in top tools but may have higher error rates than American English. Test on your actual speakers.
  • Privacy for sensitive content: Sending audio of confidential meetings to a third-party API is a privacy risk. Consider on-device (Whisper local) for sensitive data.
  • Audio quality matters more than model quality. A clean 128 kbps recording through a cheap microphone will transcribe far better than a noisy 320 kbps file. Fix the recording before fixing the model.
  • Numbers and symbols: “Four hundred and twenty” vs “420” — formats often inconsistent. May need a post-processing step for standardised formatting.
  • Different accents of the same language: Australian English vs US English vs UK English vs Indian English all “count” as English but are meaningfully different for STT models. Check which accent your target audience uses.

Pricing (mid-2026)

Most APIs charge per minute or per hour of audio:

ServiceApproximate price
Deepgram Nova-3~0.26/hour)
AssemblyAI~0.39/hour)
Google Cloud STT~0.016/minute depending on feature
AWS Transcribe~$0.024/minute
Whisper (local)Free (your own hardware cost)
Whisper (via OpenAI API)~$0.006/minute

A 1-hour podcast episode costs roughly 1.50 to transcribe via API.


See also


Sources

  • OpenAI Whisper paper and documentation (2022–2026)
  • Deepgram Nova-3 announcement and benchmarks (2024–2026)
  • AssemblyAI Universal-2 documentation (2024–2026)
  • Google Cloud Speech-to-Text API documentation
  • NIST speech recognition benchmarks
  • Koenecke et al., “Racial disparities in automated speech recognition” (PNAS 2020) — accuracy differences across accents
  • Rev.com industry transcription accuracy reports (2024)