ai-generating-data

Installation

SKILL.md

Generate Synthetic Training Data

Guide the user through generating high-quality synthetic training data with DSPy. This solves the "I do not have data" problem that blocks every other AI workflow.

When NOT to generate synthetic data

You have enough real data — 200+ labeled examples is usually enough for optimization. Real data is always better than synthetic.
Exact-match tasks — if your task has a known correct answer (math, lookup, structured extraction from templates), write a script to generate test cases programmatically instead of using an LM.
The LM does not understand your domain — synthetic data inherits the generator LM's biases. For highly specialized domains (medical, legal, niche industry), a few real expert-labeled examples outweigh hundreds of synthetic ones.

Step 1: Understand the data gap

Ask the user:

What does your AI do? (classification, extraction, Q&A, generation?)
How many real examples do you have? (zero, a handful, or hundreds with gaps?)
What is the gap? (no data at all, missing categories, edge cases, privacy constraints?)
What format are the inputs/outputs? (text in/category out, text in/JSON out, etc.)

Step 2: Define what an example looks like

Related skills

More from lebsral/dspy-programming-not-prompting-lms-skills

Installs

Repository

lebsral/dspy-pr…s-skills

GitHub Stars

First Seen

Feb 8, 2026

Security Audits

Gen Agent Trust HubPass

SocketWarn

SnykPass

ai-generating-data

Generate Synthetic Training Data

When NOT to generate synthetic data

Step 1: Understand the data gap

Step 2: Define what an example looks like

More from lebsral/dspy-programming-not-prompting-lms-skills

ai-switching-models

ai-stopping-hallucinations

ai-do

ai-reasoning

ai-building-chatbots

ai-improving-accuracy