Text encoding basics

When text garbles, it’s usually an encoding mismatch. This page outlines UTF-8 vs. UTF-16, BOM, and common pitfalls (like CSV encodings).

Key concepts 🌐

A character set maps characters to code points; an encoding turns code points into bytes.

UTF-8 vs. UTF-16

Both encode Unicode; UTF-8 is variable-length, ASCII-compatible, and the Web default.
UTF-16 uses 2-byte units (4 bytes for surrogate pairs); often larger than UTF-8.

BOM (Byte Order Mark)

UTF-8 BOM is 0xEF 0xBB 0xBF; optional but some tools expect it.
UTF-16 needs BOM to signal endianness (BE/LE).
BOM can break some scripts/configs; omit unless you need it.

Common mojibake patterns 🔄

Garbled text appears when the reader assumes the wrong encoding.

Typical causes

File saved as UTF-8 but opened as Shift_JIS/CP932 (common with some Excel defaults).
Tools that expect BOM mis-handle BOM-less UTF-8, or treat BOM bytes as data.
HTTP responses declaring the wrong charset (e.g., says Shift_JIS but sends UTF-8).

Fixes

Standardize on UTF-8 (no BOM) end-to-end where possible.
If you must use Shift_JIS variants, convert explicitly and tell consumers.
In HTTP, set Content-Type with correct charset; in HTML, include .

CSV pitfalls 📄

CSV defaults vary by tool, so encodings often mismatch.

Encoding gotchas

Some localized Excel versions default to CP932/Shift_JIS; UTF-8 CSV may garble.
Modern Excel detects BOM-ed UTF-8; without BOM it may guess wrong.
Unix tools and data platforms expect UTF-8 (often BOM-less).

Recommended handling

Use UTF-8 (no BOM) internally.
For Excel users, provide BOM-ed UTF-8 or instructions to import with UTF-8.
Document the charset in specs, since CSV files cannot self-describe charset.

< Choosing data formats Data integrity checks >