Practical Hash Tips
This page focuses on using hashes safely in real projects: collision risk, when to use HMAC or signatures, password hashing, and avoiding common pitfalls in encoding and parsing.
🧭 What This Covers
Overview
- ✅ Collision risk intuition and cautions when truncating hashes
- ✅ Checksum (CRC) vs cryptographic hash and where each fits
- ✅ HMAC vs digital signatures vs password hashing (KDF)
- ✅ Encoding/normalization pitfalls that change the hash
- ✅ Streaming large files and content-addressable examples
⚠️ Collisions and Truncation
Shorter hashes collide sooner; choose bit length by threat model.
Bit-length intuition
| Bits | Birthday-bound scale | Typical note |
|---|---|---|
| 64 | Collisions around ~5e9 items | Temporary IDs or short hashes; not for long-term safety. |
| 128 | ~3e19 | MD5 has practical collisions; avoid for safety. |
| 160 | ~1e24 | SHA-1 has practical collisions; avoid for safety. |
| 256 | ~1e38 | SHA-256+ is standard for safety. |
When truncating
- ✂️ Truncating to N bits makes collisions feasible around 2^(N/2) (birthday bound).
- ✂️ Keep enough bits or store the full hash if you shorten for URLs/UI.
- ✂️ “Safe enough” depends on attacker cost; for public IDs prefer SHA-256-length roots.
🧮 Checksum vs Cryptographic Hash
CRC/Adler detect random errors; they do not resist intentional tampering.
Choose appropriately
- 🔍 Transfer integrity (non-adversarial) → CRC32/Adler32 can suffice.
- 🔍 Tamper detection/authentication → cryptographic hashes (SHA-256+).
- 🔍 Publish file hashes with SHA-256 or stronger for downloads.
🧰 HMAC / Signatures / Password Hashing
Who holds the secret and what you protect determines the tool.
Roles
- 🧠 HMAC: shared secret; detects tampering but is forgeable if the secret leaks.
- 🧠 Digital signature: public verification; private key protection and canonicalization matter (XML/JSON).
- 🧠 Password hashing (KDF): PBKDF2/bcrypt/scrypt/Argon2 with salt, stretching, and often memory hardness.
Migration tips
- 🔧 On login, rehash with the new scheme to migrate gradually.
- 🔧 A pepper (app-held secret) adds resilience but needs rotation planning.
🧾 Encoding and Normalization Pitfalls
Different encodings or input normalization can change the hash entirely.
Common gotchas
- 📌 Hex case, `0x` prefixes, and separators create different strings.
- 📌 BASE64 vs BASE64URL, line breaks on/off (email tools often wrap).
- 📌 Input newline (LF/CRLF) or Unicode normalization (NFC/NFD) differences yield different hashes.
Mitigations
- 🛠️ Normalize input before hashing and document the policy.
- 🛠️ Align on the encoding (hex/BASE64URL) when exchanging hashes externally.
⏩ Streaming Hashes and Commands
Hash large files in chunks to avoid excessive memory use.
Practical points
- 🚚 `hash_update` / `hash_file` (PHP) or `openssl dgst -sha256 file` support streaming.
- 🚚 For many files, keep a manifest (filename + hash) for bulk verification.
- 🚚 If you need partial verification, consider range-hashing designs for large storage.
📦 Content-Addressable Examples
Addressing by content improves reproducibility and caching.
Examples
- 🧱 Git object IDs (SHA-1 → SHA-256 migration)
- 🧱 IPFS/CAS content IDs (Base58/Base32 hash encodings)
- 🧱 Docker layer digests (SHA-256)
❓ FAQ
Q. Is MD5 or SHA-1 acceptable?
- A. Not for collision resistance. Restrict to legacy/non-safety contexts and plan replacement.
Q. How many bits are safe?
- A. SHA-256 (256 bits) is the common baseline. For long-term secrecy, consider SHA-512-family for margin.