ใ‚†ใ‚‹ใƒ†ใƒƒใ‚ฏใƒŽใƒผใƒˆ

Text encoding basics

When text garbles, itโ€™s usually an encoding mismatch. This page outlines UTF-8 vs. UTF-16, BOM, and common pitfalls (like CSV encodings).

Key concepts ๐ŸŒ

A character set maps characters to code points; an encoding turns code points into bytes.

UTF-8 vs. UTF-16

  • Both encode Unicode; UTF-8 is variable-length, ASCII-compatible, and the Web default.
  • UTF-16 uses 2-byte units (4 bytes for surrogate pairs); often larger than UTF-8.

BOM (Byte Order Mark)

  • UTF-8 BOM is 0xEF 0xBB 0xBF; optional but some tools expect it.
  • UTF-16 needs BOM to signal endianness (BE/LE).
  • BOM can break some scripts/configs; omit unless you need it.

Common mojibake patterns ๐Ÿ”„

Garbled text appears when the reader assumes the wrong encoding.

Typical causes

  • File saved as UTF-8 but opened as Shift_JIS/CP932 (common with some Excel defaults).
  • Tools that expect BOM mis-handle BOM-less UTF-8, or treat BOM bytes as data.
  • HTTP responses declaring the wrong charset (e.g., says Shift_JIS but sends UTF-8).

Fixes

  • Standardize on UTF-8 (no BOM) end-to-end where possible.
  • If you must use Shift_JIS variants, convert explicitly and tell consumers.
  • In HTTP, set Content-Type with correct charset; in HTML, include .

CSV pitfalls ๐Ÿ“„

CSV defaults vary by tool, so encodings often mismatch.

Encoding gotchas

  • Some localized Excel versions default to CP932/Shift_JIS; UTF-8 CSV may garble.
  • Modern Excel detects BOM-ed UTF-8; without BOM it may guess wrong.
  • Unix tools and data platforms expect UTF-8 (often BOM-less).

Recommended handling

  • Use UTF-8 (no BOM) internally.
  • For Excel users, provide BOM-ed UTF-8 or instructions to import with UTF-8.
  • Document the charset in specs, since CSV files cannot self-describe charset.