ai-generating-data

Installation
SKILL.md

Generate Synthetic Training Data

Guide the user through generating high-quality synthetic training data with DSPy. This solves the "I do not have data" problem that blocks every other AI workflow.

When NOT to generate synthetic data

  • You have enough real data — 200+ labeled examples is usually enough for optimization. Real data is always better than synthetic.
  • Exact-match tasks — if your task has a known correct answer (math, lookup, structured extraction from templates), write a script to generate test cases programmatically instead of using an LM.
  • The LM does not understand your domain — synthetic data inherits the generator LM's biases. For highly specialized domains (medical, legal, niche industry), a few real expert-labeled examples outweigh hundreds of synthetic ones.

Step 1: Understand the data gap

Ask the user:

  1. What does your AI do? (classification, extraction, Q&A, generation?)
  2. How many real examples do you have? (zero, a handful, or hundreds with gaps?)
  3. What is the gap? (no data at all, missing categories, edge cases, privacy constraints?)
  4. What format are the inputs/outputs? (text in/category out, text in/JSON out, etc.)

Step 2: Define what an example looks like

Related skills

More from lebsral/dspy-programming-not-prompting-lms-skills

Installs
20
GitHub Stars
5
First Seen
Feb 8, 2026