data-catalog

Purpose

This skill manages metadata for data assets, enabling discovery, governance, and lineage tracking in data engineering workflows. It catalogs datasets, schemas, and dependencies to support data-driven projects.

When to Use

Use this skill when you need to track data assets in a project, such as during ETL processes, data governance audits, or when building data pipelines. Apply it in scenarios involving large-scale data repositories, compliance requirements, or collaborative data teams.

Key Capabilities

Register and update metadata for datasets using JSON structures, e.g., {"name": "sales_data", "schema": {"columns": ["id", "date"]}}.
Search and query assets via full-text or tag-based filters, supporting lineage queries like tracing data origins.
Enforce governance policies, such as access controls, by associating tags like "sensitive" to assets.
Generate lineage graphs in JSON format, e.g., {"source": "raw_logs", "target": "processed_reports"}.
Integrate with storage systems like S3 or databases, using connectors that require API keys via $DATA_CATALOG_API_KEY.

Usage Patterns

To use this skill, first authenticate with an environment variable like export DATA_CATALOG_API_KEY=your_key. Then, follow a pattern: initialize the catalog, register assets, query as needed, and handle updates. For pipelines, embed it in scripts to auto-register outputs. Always validate metadata before operations to avoid conflicts.