ゆるテックノート

Choosing data formats

“Just use CSV or JSON” is common, but each format has sweet spots. Here’s a quick guide for exchange, storage, and analytics.

When you need tabular data

For rows and columns, pick text or columnar binary based on how you’ll process it.

CSV / TSV

  • Human-readable text. CSV uses commas; TSV uses tabs and is simpler when values contain commas.
  • Escaping and embedded newlines can be tricky. TSV avoids commas but needs care for tabs in data.
  • Great for ad hoc exchange and spreadsheets; you still need to share the schema separately.

Parquet / Arrow

  • Columnar binary with compression/encoding; fast for analytics and carries types.
  • Standard in big data stacks (Spark/Presto/BigQuery compatibility). Arrow also defines in-memory layout.
  • Not human-friendly, but good for long-term storage and analysis with explicit schema.

When you need flexible structure

For nested objects and arrays, JSON variants are a good fit.

JSON / JSONL

  • Key/value with nesting; common for APIs and configs.
  • JSONL (one object per line) works well for streaming and incremental processing.
  • Types are loose; pair with a schema (e.g., JSON Schema) to catch breaking changes.

Quick guidance by use case

Balance readability, size, speed, and tool support.

Recommendations

  • Human-friendly editing: TSV/CSV (mind escaping).
  • APIs, configs, logs: JSON/JSONL (manage schema separately).
  • Analytics and large datasets: columnar binaries like Parquet/Arrow with schema.