Choosing data formats
“Just use CSV or JSON” is common, but each format has sweet spots. Here’s a quick guide for exchange, storage, and analytics.
When you need tabular data
For rows and columns, pick text or columnar binary based on how you’ll process it.
CSV / TSV
- Human-readable text. CSV uses commas; TSV uses tabs and is simpler when values contain commas.
- Escaping and embedded newlines can be tricky. TSV avoids commas but needs care for tabs in data.
- Great for ad hoc exchange and spreadsheets; you still need to share the schema separately.
Parquet / Arrow
- Columnar binary with compression/encoding; fast for analytics and carries types.
- Standard in big data stacks (Spark/Presto/BigQuery compatibility). Arrow also defines in-memory layout.
- Not human-friendly, but good for long-term storage and analysis with explicit schema.
When you need flexible structure
For nested objects and arrays, JSON variants are a good fit.
JSON / JSONL
- Key/value with nesting; common for APIs and configs.
- JSONL (one object per line) works well for streaming and incremental processing.
- Types are loose; pair with a schema (e.g., JSON Schema) to catch breaking changes.
Quick guidance by use case
Balance readability, size, speed, and tool support.
Recommendations
- Human-friendly editing: TSV/CSV (mind escaping).
- APIs, configs, logs: JSON/JSONL (manage schema separately).
- Analytics and large datasets: columnar binaries like Parquet/Arrow with schema.