Data Catalogs

Comprehensive guide to data catalog systems: purpose, Iceberg catalog implementations (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source catalog tools (Amundsen, DataHub, OpenMetadata). Learn selection criteria, setup patterns, and best practices for data discovery, governance, and unified querying.

Why Catalogs Matter

Data catalogs are centralized metadata repositories that enable:

Data discovery: Find datasets by name, schema, owner, tags
Governance: Access control, data lineage, PII tagging
Schema management: Track table schemas, partitions, evolution over time
Table format abstraction: Iceberg/Delta/Hudi tables registered in catalog can be queried by multiple engines (Spark, Trino, Flink, DuckDB) without knowing underlying storage URIs
Multi-engine consistency: Same table name across Spark/DuckDB/Trino

Without a catalog, you must manage table locations and schemas manually in each engine.

data-engineering-catalogs

Data Catalogs

Why Catalogs Matter

More from legout/data-platform-agent-skills

data-science-eda

data-science-visualization

data-engineering-core

data-science-feature-engineering

data-science-notebooks

data-engineering-best-practices