Faker Data Generation Patterns

Overview

When generating synthetic data for Databricks Bronze layer tables, use Faker with configurable data corruption to test Silver layer data quality expectations.

Upstream: Synthetic Data Generation Workflow

The upstream databricks-synthetic-data-generation skill in AI-Dev-Kit introduces a file-based workflow:

File-Based Execution

Write Python code to a local file (e.g., scripts/generate_data.py)
Execute on Databricks using the run_python_file_on_databricks MCP tool
If execution fails, edit the local file and re-execute

Context Reuse

The first execution auto-selects a running cluster and creates an execution context. Reuse cluster_id and context_id for follow-up calls (faster: ~1s vs ~15s).

Raw Data Only

By default, generate raw transactional data only — no total_x, sum_x, avg_x fields. SDP pipelines compute aggregations downstream.

faker-data-generation