data-engineering-storage-formats
Data Storage Formats
Comprehensive guide to modern data serialization formats for analytics and machine learning: Parquet, Apache Arrow, Lance, Zarr, Avro, and ORC. Learn compression tradeoffs, partitioning strategies, and when to use each format.
Quick Comparison
| Format | Type | Best For | Compression | Schema Evolution | Random Access |
|---|---|---|---|---|---|
| Parquet | Columnar | Analytics, data lakes | ✅ (Snappy, Zstd, LZ4) | ✅ (add/drop) | ✅ (row groups) |
| Arrow/Feather | Columnar | In-memory, IPC, ML | ✅ (LZ4, Zstd) | Limited | ✅ (record batches) |
| Lance | Columnar | ML pipelines, vectors | ✅ (Zstd, LZ4) | ✅ | ✅ (multi-modal) |
| Zarr | Chunked arrays | ML, geospatial, N-dim | ✅ (Blosc, gzip) | ✅ (chunks) | ✅ (chunk-level) |
| Avro | Row-based | Streaming, Kafka | ✅ (deflate, snappy) | ✅ (full schema) | ❌ (sequential) |
| ORC | Columnar | Hive, Hadoop | ✅ (ZLIB, Snappy) | Limited | ✅ (stripe-level) |
When to Use Which?
Choose Parquet when:
- You need broad compatibility (Spark, DuckDB, Polars, pandas)
More from legout/data-agent-skills
data-engineering
Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines.
5data-engineering-storage-remote-access-libraries-obstore
High-performance Rust-based remote filesystem library. Covers store creation, basic operations, async API, streaming uploads, Arrow integration, and fsspec compatibility wrapper.
4data-engineering-storage-remote-access-integrations-iceberg
Apache Iceberg catalog configuration for cloud storage (S3, GCS, Azure). Covers AWS Glue and REST catalogs, table scanning, and append/overwrite operations.
4data-science-eda
Exploratory Data Analysis (EDA): profiling, visualization, correlation analysis, and data quality checks. Use when understanding dataset structure, distributions, relationships, or preparing for feature engineering and modeling.
4data-science-notebooks
Interactive notebooks for data science: Jupyter, JupyterLab, and marimo. Use for exploratory analysis, reproducible research, documentation, and sharing insights with stakeholders.
4data-engineering-storage-remote-access-libraries-fsspec
Comprehensive guide to fsspec: the universal filesystem interface for Python. Covers S3, GCS, Azure via s3fs, gcsfs, adlfs; protocol chaining, caching, async operations, and integration with the data ecosystem.
4