faker-data-generation
Installation
SKILL.md
Faker Data Generation Patterns
Overview
When generating synthetic data for Databricks Bronze layer tables, use Faker with configurable data corruption to test Silver layer data quality expectations.
Upstream: Synthetic Data Generation Workflow
The upstream databricks-synthetic-data-generation skill in AI-Dev-Kit introduces a file-based workflow:
File-Based Execution
- Write Python code to a local file (e.g.,
scripts/generate_data.py) - Execute on Databricks using the
run_python_file_on_databricksMCP tool - If execution fails, edit the local file and re-execute
Context Reuse
The first execution auto-selects a running cluster and creates an execution context. Reuse cluster_id and context_id for follow-up calls (faster: ~1s vs ~15s).
Raw Data Only
By default, generate raw transactional data only — no total_x, sum_x, avg_x fields. SDP pipelines compute aggregations downstream.