Text encodings & UTF-8
Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Why “café” sometimes shows up as “café” — and the one rule (use UTF-8 everywhere) that makes the problem go away.
In plain English
Computers store everything as numbers. Text isn’t an exception — every character (A, é, 🎉) is represented as a number under the hood. The encoding is the agreement between programs about which number means which character.
If you write text in one encoding and another program reads it expecting a different encoding, you get mojibake — garbled output where the right bytes are interpreted as the wrong characters. The classic symptom: text containing accents or non-Latin characters comes out as nonsense.
“café” → saved correctly → opened with wrong encoding → “café”
For ~30 years this was a real problem. Different OSes, programs, and languages used different encodings (ASCII, Latin-1, Windows-1252, Shift-JIS, Big5, etc.), and converting between them was a constant chore.
UTF-8 solved this. It’s an encoding that can represent every character in every writing system (over 150,000 of them, including emoji), is backwards-compatible with ASCII for English text, and is the de facto standard for the modern web and for almost all file formats.
The practical rule for 2026: use UTF-8 for everything. Every file. Every database column. Every HTTP response. Every config file. You’ll occasionally still hit legacy text in another encoding — but UTF-8 is the right choice for any new work.
Why it matters
You won’t think about encodings often. But when something goes wrong, the symptoms look like total chaos. Knowing the model lets you diagnose:
- “Why does my CSV have weird characters?”
- “Why does my user’s name show up wrong in the database?”
- “Why does Claude Code’s output have
??instead of curly quotes?” - “Why does my .env file say my Slack token has a
​in it?”
All encoding issues. All fixable once you know what to look at.
The hierarchy: characters → code points → bytes
Characters
What humans see: A, Ă©, ä¸, 🎉.
Code points
Numbers assigned by Unicode, the standard that defines every character. Each character has one code point.
A→ U+0041 (decimal 65)é→ U+00E9 (decimal 233)ä¸â†’ U+4E2D (decimal 20013)🎉→ U+1F389 (decimal 127881)
Unicode currently has ~155,000 assigned code points covering virtually every writing system, plus emoji.
Bytes
Code points have to be stored as bytes (0–255). An encoding is the algorithm that turns code points into bytes (and vice versa).
The same code points can be encoded different ways. UTF-8 is one encoding; UTF-16 is another; Latin-1 covers only some code points.
UTF-8 — how it works
UTF-8 is a clever variable-length encoding:
| Code point range | Bytes used |
|---|---|
| U+0000 – U+007F (ASCII) | 1 byte |
| U+0080 – U+07FF | 2 bytes |
| U+0800 – U+FFFF | 3 bytes |
| U+10000 – U+10FFFF (emoji, rare scripts) | 4 bytes |
ASCII characters (A–Z, digits, basic punctuation) take 1 byte each. European accents take 2. Asian scripts take 3. Emoji take 4.
This means English text in UTF-8 is identical to ASCII — backwards-compatible. A program written in 1990 expecting ASCII can still read most English UTF-8 text without issue.
The variable length is also what saves space: if your file is mostly English, you mostly pay 1 byte per character. Only the special characters cost more.
ASCII — the granddaddy
ASCII (American Standard Code for Information Interchange) is the original character encoding, dating from the 1960s. It defines 128 characters: digits 0–9, English letters A–Z (uppercase and lowercase), basic punctuation, and control characters.
ASCII uses 1 byte (well, 7 bits) per character. It’s tiny. It only handles English.
UTF-8 is intentionally designed so that valid ASCII is also valid UTF-8. This is why the transition has been so smooth — every English text file ever made already works in UTF-8.
Encodings you’ll occasionally encounter
Beyond UTF-8 and ASCII:
| Encoding | Story |
|---|---|
| Latin-1 (ISO-8859-1) | Single-byte encoding for Western European languages. Common in old web pages and emails. |
| Windows-1252 | Microsoft’s Latin-1 variant. Still common in older Windows-generated text files. |
| UTF-16 | Two-byte (sometimes 4) encoding. Used internally by Windows, Java, JavaScript strings. Wasteful for ASCII; rarely chosen for files. |
| Shift-JIS / EUC-JP / EUC-KR | Asian-language encodings, mostly legacy now. |
| Big5 | Traditional Chinese, mostly legacy. |
When you see weird characters, the file is usually in Windows-1252, Latin-1, or UTF-16 and being read as UTF-8 (or vice versa).
The BOM (Byte Order Mark)
Some UTF files start with a few “marker” bytes (the BOM) that say “this is UTF-X, byte order Y.” UTF-8 doesn’t need a BOM (it has no byte-order ambiguity), but Windows applications often add one anyway.
The UTF-8 BOM is the bytes EF BB BF at the start of the file. Most tools handle it transparently. But some don’t:
- Bash scripts: a BOM at the top breaks the shebang line — script won’t run
- JSON files: a BOM may make JSON parsers fail
- CSV files: a BOM may show up as a
character in the first cell
Best practice: save UTF-8 files without the BOM unless you have a specific reason to need it.
In Windows PowerShell:
Set-Content file.txt -Encoding utf8NoBOMVS Code shows “UTF-8 with BOM” vs “UTF-8” in the status bar at the bottom. You can convert via Command Palette → “Change File Encoding.”
A concrete example: round-trip success and failure
Success:
1. You write "café" in VS Code (UTF-8 by default)
2. Saved bytes: 63 61 66 C3 A9 (5 bytes — "caf" is 3 bytes, "é" is 2 in UTF-8)
3. Git stores the bytes exactly
4. You push to GitHub; another developer pulls
5. Their VS Code reads the file as UTF-8
6. Displays "café" correctly
Failure:
1. You write "café" in an old Windows program (Windows-1252 by default)
2. Saved bytes: 63 61 66 E9 (4 bytes — "é" is 1 byte in Win-1252)
3. Email it to someone with a modern system that defaults to UTF-8
4. Their email client reads 63 61 66 E9 as UTF-8
5. Sees the bytes 63 61 66 (still "caf"), then E9 — which isn't valid UTF-8
6. Displays "caf�" or "café" or other garbage depending on fallback behavior
This is the core of every encoding bug: producer and consumer disagree about what encoding the bytes are in.
How to fix encoding bugs
When you encounter mojibake:
Step 1: Identify what encoding the file actually is
Open in VS Code; the status bar shows the detected encoding. Or use a tool:
# On Linux/macOS
file -i mystery.txtStep 2: Open in the correct encoding
In VS Code: Command Palette → “Reopen with Encoding” → pick the right one. The text should display correctly.
Step 3: Re-save as UTF-8
Command Palette → “Save with Encoding” → UTF-8 (no BOM). Now the file is correctly encoded for the modern world.
Step 4: Source-control the fix
Commit the conversion. Future readers won’t have the problem.
Encodings in different contexts
HTML
<meta charset="utf-8">Tells the browser the page is UTF-8. Modern Next.js / framework projects include this by default. Without it, the browser guesses, sometimes wrong.
HTTP headers
The server can specify:
Content-Type: text/html; charset=utf-8
This overrides the HTML meta tag.
Databases
Postgres / Supabase default to UTF-8. You generally don’t have to think about it.
MySQL has had complicated history with encodings (default utf8 was secretly only 3-byte UTF-8, can’t store emoji; you wanted utf8mb4 which is real UTF-8). Modern MySQL has fixed this but the legacy bites occasionally.
JSON
JSON is required to be UTF-8 (or UTF-16/UTF-32 by spec, but UTF-8 in practice). If you generate JSON, use UTF-8.
Source code
Almost all programming languages assume source files are UTF-8. Modern editors save UTF-8 by default. The few that don’t (legacy Windows tools) cause bugs.
Environment variables
.env files are plain text. UTF-8 is standard. Don’t put unusual characters in env var values unless necessary.
Common gotchas
-
The BOM breaks shebang lines. A
.shscript starting withEF BB BF #!/bin/bashwon’t run on Linux — the shell doesn’t recognize the BOM. Save shell scripts as UTF-8 without BOM. -
“It works in my editor but breaks in production.” Usually: your editor reopens files with the right encoding even if they’re stored wrong; the production tool reads bytes directly. Save the file correctly.
-
Default encoding varies by OS. Windows Notepad in older versions defaulted to UTF-16 or Windows-1252. Modern Notepad defaults to UTF-8 (since around 2019). Older files in your archive may not be UTF-8.
-
PowerShell 5.1 reads files as Windows-1252 by default. Use
-Encoding utf8explicitly when working withOut-File,Get-Contentif cross-platform compatibility matters. -
String length isn’t byte length. In UTF-8,
"café".lengthin JavaScript is 4 (characters), but the byte length is 5. For database column sizing, file size, or byte-level operations, count bytes. -
JavaScript strings are UTF-16 internally. A
string.lengthof 1 doesn’t mean 1 byte or 1 character — it means 1 UTF-16 code unit. For emoji that need 2 code units,"🎉".length === 2. Use[...str].lengthto count actual characters. -
Surrogate pairs in JavaScript. Some code points (emoji, rare characters) use 2 UTF-16 units. Slicing them naĂŻvely produces garbage.
-
CSV exports from Excel. Excel may save CSVs in Windows-1252 or with a BOM. If your CSV import is broken, suspect encoding before suspecting the code.
-
Non-Latin URLs. URLs with non-ASCII characters get percent-encoded (
%E2%9C%93for ✓). Browsers handle this transparently; sometimes scripts don’t. -
Encoding declaration mismatch. A file says
<meta charset="iso-8859-1">at the top but is actually UTF-8 — browser interprets it wrong. Make sure declarations match reality. -
Zero-width characters are real characters. Strings can contain invisible characters (
​zero-width space,‌zero-width non-joiner). They cause weirdness in matching, especially in things like API keys pasted from rich-text editors. If a string “looks right” but doesn’t match, check for hidden characters.
See also
- Files and folders 🟩 — text encoding affects file contents
- HTML 🟩 —
<meta charset="utf-8">is part of every page - JavaScript 🟩 — string handling and surrogate pairs
- What is a computer? 🟩 — bytes vs characters
- command line 🟩 — shell encoding pitfalls
- Glossary: Unicode, UTF-8
Sources
- The Absolute Minimum Every Software Developer Must Know About Unicode (Joel Spolsky, 2003) — the classic introduction
- Unicode Consortium
- UTF-8 Everywhere — manifesto and best practices
- MDN — Character encoding