data-engineering-storage-formats

Installation
SKILL.md

Data Storage Formats

Comprehensive guide to modern data serialization formats for analytics and machine learning: Parquet, Apache Arrow, Lance, Zarr, Avro, and ORC. Learn compression tradeoffs, partitioning strategies, and when to use each format.

Quick Comparison

Format Type Best For Compression Schema Evolution Random Access
Parquet Columnar Analytics, data lakes ✅ (Snappy, Zstd, LZ4) ✅ (add/drop) ✅ (row groups)
Arrow/Feather Columnar In-memory, IPC, ML ✅ (LZ4, Zstd) Limited ✅ (record batches)
Lance Columnar ML pipelines, vectors ✅ (Zstd, LZ4) ✅ (multi-modal)
Zarr Chunked arrays ML, geospatial, N-dim ✅ (Blosc, gzip) ✅ (chunks) ✅ (chunk-level)
Avro Row-based Streaming, Kafka ✅ (deflate, snappy) ✅ (full schema) ❌ (sequential)
ORC Columnar Hive, Hadoop ✅ (ZLIB, Snappy) Limited ✅ (stripe-level)

When to Use Which?

Choose Parquet when:

  • You need broad compatibility (Spark, DuckDB, Polars, pandas)
Related skills
Installs
4
First Seen
Mar 1, 2026