ゆるテックノート

Practical Hash Tips

This page focuses on using hashes safely in real projects: collision risk, when to use HMAC or signatures, password hashing, and avoiding common pitfalls in encoding and parsing.

🧭 What This Covers

Overview

  • Collision risk intuition and cautions when truncating hashes
  • Checksum (CRC) vs cryptographic hash and where each fits
  • HMAC vs digital signatures vs password hashing (KDF)
  • Encoding/normalization pitfalls that change the hash
  • Streaming large files and content-addressable examples

⚠️ Collisions and Truncation

Shorter hashes collide sooner; choose bit length by threat model.

Bit-length intuition

Bits Birthday-bound scale Typical note
64 Collisions around ~5e9 items Temporary IDs or short hashes; not for long-term safety.
128 ~3e19 MD5 has practical collisions; avoid for safety.
160 ~1e24 SHA-1 has practical collisions; avoid for safety.
256 ~1e38 SHA-256+ is standard for safety.

When truncating

  • ✂️ Truncating to N bits makes collisions feasible around 2^(N/2) (birthday bound).
  • ✂️ Keep enough bits or store the full hash if you shorten for URLs/UI.
  • ✂️ “Safe enough” depends on attacker cost; for public IDs prefer SHA-256-length roots.

🧮 Checksum vs Cryptographic Hash

CRC/Adler detect random errors; they do not resist intentional tampering.

Choose appropriately

  • 🔍 Transfer integrity (non-adversarial) → CRC32/Adler32 can suffice.
  • 🔍 Tamper detection/authentication → cryptographic hashes (SHA-256+).
  • 🔍 Publish file hashes with SHA-256 or stronger for downloads.

🧰 HMAC / Signatures / Password Hashing

Who holds the secret and what you protect determines the tool.

Roles

  • 🧠 HMAC: shared secret; detects tampering but is forgeable if the secret leaks.
  • 🧠 Digital signature: public verification; private key protection and canonicalization matter (XML/JSON).
  • 🧠 Password hashing (KDF): PBKDF2/bcrypt/scrypt/Argon2 with salt, stretching, and often memory hardness.

Migration tips

  • 🔧 On login, rehash with the new scheme to migrate gradually.
  • 🔧 A pepper (app-held secret) adds resilience but needs rotation planning.

🧾 Encoding and Normalization Pitfalls

Different encodings or input normalization can change the hash entirely.

Common gotchas

  • 📌 Hex case, `0x` prefixes, and separators create different strings.
  • 📌 BASE64 vs BASE64URL, line breaks on/off (email tools often wrap).
  • 📌 Input newline (LF/CRLF) or Unicode normalization (NFC/NFD) differences yield different hashes.

Mitigations

  • 🛠️ Normalize input before hashing and document the policy.
  • 🛠️ Align on the encoding (hex/BASE64URL) when exchanging hashes externally.

Streaming Hashes and Commands

Hash large files in chunks to avoid excessive memory use.

Practical points

  • 🚚 `hash_update` / `hash_file` (PHP) or `openssl dgst -sha256 file` support streaming.
  • 🚚 For many files, keep a manifest (filename + hash) for bulk verification.
  • 🚚 If you need partial verification, consider range-hashing designs for large storage.

📦 Content-Addressable Examples

Addressing by content improves reproducibility and caching.

Examples

  • 🧱 Git object IDs (SHA-1 → SHA-256 migration)
  • 🧱 IPFS/CAS content IDs (Base58/Base32 hash encodings)
  • 🧱 Docker layer digests (SHA-256)

FAQ

Q. Is MD5 or SHA-1 acceptable?

  • A. Not for collision resistance. Restrict to legacy/non-safety contexts and plan replacement.

Q. How many bits are safe?

  • A. SHA-256 (256 bits) is the common baseline. For long-term secrecy, consider SHA-512-family for margin.