eval-bootstrap

Installation
SKILL.md

Eval Bootstrap — Generate Evaluators from Production Traces

Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:

  • sdk_code (default) — Python .py file using the Datadog Evals SDK (BaseEvaluator / LLMJudge) for offline experiments.
  • data_only — self-contained JSON spec, framework-agnostic.
  • publish — write online LLM-judge evaluators directly to Datadog via create_or_update_llmobs_evaluator. These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.

Usage

/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]

Arguments: $ARGUMENTS

Inputs

| Input | Required | Default | Description |

Related skills
Installs
17
GitHub Stars
105
First Seen
Apr 16, 2026