data-engineering-catalogs

Installation
SKILL.md

Data Catalogs

Comprehensive guide to data catalog systems: purpose, Iceberg catalog implementations (Hive Metastore, AWS Glue, Tabular), using DuckDB as a lightweight multi-source catalog, and comparisons of open-source catalog tools (Amundsen, DataHub, OpenMetadata). Learn selection criteria, setup patterns, and best practices for data discovery, governance, and unified querying.


Why Catalogs Matter

Data catalogs are centralized metadata repositories that enable:

  • Data discovery: Find datasets by name, schema, owner, tags
  • Governance: Access control, data lineage, PII tagging
  • Schema management: Track table schemas, partitions, evolution over time
  • Table format abstraction: Iceberg/Delta/Hudi tables registered in catalog can be queried by multiple engines (Spark, Trino, Flink, DuckDB) without knowing underlying storage URIs
  • Multi-engine consistency: Same table name across Spark/DuckDB/Trino

Without a catalog, you must manage table locations and schemas manually in each engine.


Related skills

More from legout/data-platform-agent-skills

Installs
6
First Seen
Feb 11, 2026