Text encoding basics
When text garbles, itโs usually an encoding mismatch. This page outlines UTF-8 vs. UTF-16, BOM, and common pitfalls (like CSV encodings).
Key concepts ๐
A character set maps characters to code points; an encoding turns code points into bytes.
UTF-8 vs. UTF-16
- Both encode Unicode; UTF-8 is variable-length, ASCII-compatible, and the Web default.
- UTF-16 uses 2-byte units (4 bytes for surrogate pairs); often larger than UTF-8.
BOM (Byte Order Mark)
- UTF-8 BOM is 0xEF 0xBB 0xBF; optional but some tools expect it.
- UTF-16 needs BOM to signal endianness (BE/LE).
- BOM can break some scripts/configs; omit unless you need it.
Common mojibake patterns ๐
Garbled text appears when the reader assumes the wrong encoding.
Typical causes
- File saved as UTF-8 but opened as Shift_JIS/CP932 (common with some Excel defaults).
- Tools that expect BOM mis-handle BOM-less UTF-8, or treat BOM bytes as data.
- HTTP responses declaring the wrong charset (e.g., says Shift_JIS but sends UTF-8).
Fixes
- Standardize on UTF-8 (no BOM) end-to-end where possible.
- If you must use Shift_JIS variants, convert explicitly and tell consumers.
- In HTTP, set Content-Type with correct charset; in HTML, include .
CSV pitfalls ๐
CSV defaults vary by tool, so encodings often mismatch.
Encoding gotchas
- Some localized Excel versions default to CP932/Shift_JIS; UTF-8 CSV may garble.
- Modern Excel detects BOM-ed UTF-8; without BOM it may guess wrong.
- Unix tools and data platforms expect UTF-8 (often BOM-less).
Recommended handling
- Use UTF-8 (no BOM) internally.
- For Excel users, provide BOM-ed UTF-8 or instructions to import with UTF-8.
- Document the charset in specs, since CSV files cannot self-describe charset.