databricks-synthetic-data-gen

Installation
SKILL.md

Catalog and schema are always user-supplied — never default to any value. If the user hasn't provided them, ask. For any UC write, always create the schema if it doesn't exist before writing data.

Databricks Synthetic Data Generation

Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).

Data Must Tell a Business Story

Synthetic data should demonstrate how Databricks helps solve real business problems.

The pattern: Something goes wrong → business impact ($) → analyze root cause → identify affected customers → fix and prevent.

Key principles:

  • Problem → Impact → Analysis → Solution — Include an incident, anomaly, or issue that causes measurable business impact. The data lets you find the root cause and act on it.
  • Industry-relevant but simple — Use domain terms (e.g., "SLA breach", "churn", "stockout") but keep the schema easy to understand. A few tables, clear relationships.
  • Business metrics with $ impact — Revenue, MRR, cost, conversion rate. Every story needs a dollar sign to show why it matters.
  • Tables explain each other — Ticket spike? Incident table shows the outage. Revenue drop? Churn table shows who left and why. All data connects.
  • Actionable insights — Data should answer: What happened? Who's affected? How much did it cost? How do we prevent it?
Related skills

More from databricks-solutions/ai-dev-kit

Installs
13
GitHub Stars
1.5K
First Seen
Mar 5, 2026