Text encodings & UTF-8

Status: 🟩 COMPLETE Last updated: 2026-06-19 Plain-English tagline: Why “café” sometimes shows up as “café” — and the one rule (use UTF-8 everywhere) that makes the problem go away.


In plain English

Computers store everything as numbers. Text isn’t an exception — every character (A, é, 🎉) is represented as a number under the hood. The encoding is the agreement between programs about which number means which character.

If you write text in one encoding and another program reads it expecting a different encoding, you get mojibake — garbled output where the right bytes are interpreted as the wrong characters. The classic symptom: text containing accents or non-Latin characters comes out as nonsense.

“café” → saved correctly → opened with wrong encoding → “café”

For ~30 years this was a real problem. Different OSes, programs, and languages used different encodings (ASCII, Latin-1, Windows-1252, Shift-JIS, Big5, etc.), and converting between them was a constant chore.

UTF-8 solved this. It’s an encoding that can represent every character in every writing system (over 150,000 of them, including emoji), is backwards-compatible with ASCII for English text, and is the de facto standard for the modern web and for almost all file formats.

The practical rule for 2026: use UTF-8 for everything. Every file. Every database column. Every HTTP response. Every config file. You’ll occasionally still hit legacy text in another encoding — but UTF-8 is the right choice for any new work.


Why it matters

You won’t think about encodings often. But when something goes wrong, the symptoms look like total chaos. Knowing the model lets you diagnose:

  • “Why does my CSV have weird characters?”
  • “Why does my user’s name show up wrong in the database?”
  • “Why does Claude Code’s output have ?? instead of curly quotes?”
  • “Why does my .env file say my Slack token has a ​ in it?”

All encoding issues. All fixable once you know what to look at.


The hierarchy: characters → code points → bytes

Characters

What humans see: A, é, 中, 🎉.

Code points

Numbers assigned by Unicode, the standard that defines every character. Each character has one code point.

  • A → U+0041 (decimal 65)
  • Ă© → U+00E9 (decimal 233)
  • 中 → U+4E2D (decimal 20013)
  • 🎉 → U+1F389 (decimal 127881)

Unicode currently has ~155,000 assigned code points covering virtually every writing system, plus emoji.

Bytes

Code points have to be stored as bytes (0–255). An encoding is the algorithm that turns code points into bytes (and vice versa).

The same code points can be encoded different ways. UTF-8 is one encoding; UTF-16 is another; Latin-1 covers only some code points.


UTF-8 — how it works

UTF-8 is a clever variable-length encoding:

Code point rangeBytes used
U+0000 – U+007F (ASCII)1 byte
U+0080 – U+07FF2 bytes
U+0800 – U+FFFF3 bytes
U+10000 – U+10FFFF (emoji, rare scripts)4 bytes

ASCII characters (A–Z, digits, basic punctuation) take 1 byte each. European accents take 2. Asian scripts take 3. Emoji take 4.

This means English text in UTF-8 is identical to ASCII — backwards-compatible. A program written in 1990 expecting ASCII can still read most English UTF-8 text without issue.

The variable length is also what saves space: if your file is mostly English, you mostly pay 1 byte per character. Only the special characters cost more.


ASCII — the granddaddy

ASCII (American Standard Code for Information Interchange) is the original character encoding, dating from the 1960s. It defines 128 characters: digits 0–9, English letters A–Z (uppercase and lowercase), basic punctuation, and control characters.

ASCII uses 1 byte (well, 7 bits) per character. It’s tiny. It only handles English.

UTF-8 is intentionally designed so that valid ASCII is also valid UTF-8. This is why the transition has been so smooth — every English text file ever made already works in UTF-8.


Encodings you’ll occasionally encounter

Beyond UTF-8 and ASCII:

EncodingStory
Latin-1 (ISO-8859-1)Single-byte encoding for Western European languages. Common in old web pages and emails.
Windows-1252Microsoft’s Latin-1 variant. Still common in older Windows-generated text files.
UTF-16Two-byte (sometimes 4) encoding. Used internally by Windows, Java, JavaScript strings. Wasteful for ASCII; rarely chosen for files.
Shift-JIS / EUC-JP / EUC-KRAsian-language encodings, mostly legacy now.
Big5Traditional Chinese, mostly legacy.

When you see weird characters, the file is usually in Windows-1252, Latin-1, or UTF-16 and being read as UTF-8 (or vice versa).


The BOM (Byte Order Mark)

Some UTF files start with a few “marker” bytes (the BOM) that say “this is UTF-X, byte order Y.” UTF-8 doesn’t need a BOM (it has no byte-order ambiguity), but Windows applications often add one anyway.

The UTF-8 BOM is the bytes EF BB BF at the start of the file. Most tools handle it transparently. But some don’t:

  • Bash scripts: a BOM at the top breaks the shebang line — script won’t run
  • JSON files: a BOM may make JSON parsers fail
  • CSV files: a BOM may show up as a  character in the first cell

Best practice: save UTF-8 files without the BOM unless you have a specific reason to need it.

In Windows PowerShell:

Set-Content file.txt -Encoding utf8NoBOM

VS Code shows “UTF-8 with BOM” vs “UTF-8” in the status bar at the bottom. You can convert via Command Palette → “Change File Encoding.”


A concrete example: round-trip success and failure

Success:

1. You write "café" in VS Code (UTF-8 by default)
2. Saved bytes: 63 61 66 C3 A9 (5 bytes — "caf" is 3 bytes, "é" is 2 in UTF-8)
3. Git stores the bytes exactly
4. You push to GitHub; another developer pulls
5. Their VS Code reads the file as UTF-8
6. Displays "café" correctly

Failure:

1. You write "café" in an old Windows program (Windows-1252 by default)
2. Saved bytes: 63 61 66 E9 (4 bytes — "é" is 1 byte in Win-1252)
3. Email it to someone with a modern system that defaults to UTF-8
4. Their email client reads 63 61 66 E9 as UTF-8
5. Sees the bytes 63 61 66 (still "caf"), then E9 — which isn't valid UTF-8
6. Displays "caf�" or "café" or other garbage depending on fallback behavior

This is the core of every encoding bug: producer and consumer disagree about what encoding the bytes are in.


How to fix encoding bugs

When you encounter mojibake:

Step 1: Identify what encoding the file actually is

Open in VS Code; the status bar shows the detected encoding. Or use a tool:

# On Linux/macOS
file -i mystery.txt

Step 2: Open in the correct encoding

In VS Code: Command Palette → “Reopen with Encoding” → pick the right one. The text should display correctly.

Step 3: Re-save as UTF-8

Command Palette → “Save with Encoding” → UTF-8 (no BOM). Now the file is correctly encoded for the modern world.

Step 4: Source-control the fix

Commit the conversion. Future readers won’t have the problem.


Encodings in different contexts

HTML

<meta charset="utf-8">

Tells the browser the page is UTF-8. Modern Next.js / framework projects include this by default. Without it, the browser guesses, sometimes wrong.

HTTP headers

The server can specify:

Content-Type: text/html; charset=utf-8

This overrides the HTML meta tag.

Databases

Postgres / Supabase default to UTF-8. You generally don’t have to think about it.

MySQL has had complicated history with encodings (default utf8 was secretly only 3-byte UTF-8, can’t store emoji; you wanted utf8mb4 which is real UTF-8). Modern MySQL has fixed this but the legacy bites occasionally.

JSON

JSON is required to be UTF-8 (or UTF-16/UTF-32 by spec, but UTF-8 in practice). If you generate JSON, use UTF-8.

Source code

Almost all programming languages assume source files are UTF-8. Modern editors save UTF-8 by default. The few that don’t (legacy Windows tools) cause bugs.

Environment variables

.env files are plain text. UTF-8 is standard. Don’t put unusual characters in env var values unless necessary.


Common gotchas

  • The BOM breaks shebang lines. A .sh script starting with EF BB BF #!/bin/bash won’t run on Linux — the shell doesn’t recognize the BOM. Save shell scripts as UTF-8 without BOM.

  • “It works in my editor but breaks in production.” Usually: your editor reopens files with the right encoding even if they’re stored wrong; the production tool reads bytes directly. Save the file correctly.

  • Default encoding varies by OS. Windows Notepad in older versions defaulted to UTF-16 or Windows-1252. Modern Notepad defaults to UTF-8 (since around 2019). Older files in your archive may not be UTF-8.

  • PowerShell 5.1 reads files as Windows-1252 by default. Use -Encoding utf8 explicitly when working with Out-File, Get-Content if cross-platform compatibility matters.

  • String length isn’t byte length. In UTF-8, "cafĂ©".length in JavaScript is 4 (characters), but the byte length is 5. For database column sizing, file size, or byte-level operations, count bytes.

  • JavaScript strings are UTF-16 internally. A string.length of 1 doesn’t mean 1 byte or 1 character — it means 1 UTF-16 code unit. For emoji that need 2 code units, "🎉".length === 2. Use [...str].length to count actual characters.

  • Surrogate pairs in JavaScript. Some code points (emoji, rare characters) use 2 UTF-16 units. Slicing them naĂŻvely produces garbage.

  • CSV exports from Excel. Excel may save CSVs in Windows-1252 or with a BOM. If your CSV import is broken, suspect encoding before suspecting the code.

  • Non-Latin URLs. URLs with non-ASCII characters get percent-encoded (%E2%9C%93 for âś“). Browsers handle this transparently; sometimes scripts don’t.

  • Encoding declaration mismatch. A file says <meta charset="iso-8859-1"> at the top but is actually UTF-8 — browser interprets it wrong. Make sure declarations match reality.

  • Zero-width characters are real characters. Strings can contain invisible characters (​ zero-width space, ‌ zero-width non-joiner). They cause weirdness in matching, especially in things like API keys pasted from rich-text editors. If a string “looks right” but doesn’t match, check for hidden characters.


See also

Sources