eval-bootstrap
Installation
SKILL.md
Eval Bootstrap — Generate Evaluators from Production Traces
Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:
sdk_code(default) — Python.pyfile using the Datadog Evals SDK (BaseEvaluator/LLMJudge) for offline experiments.data_only— self-contained JSON spec, framework-agnostic.publish— write online LLM-judge evaluators directly to Datadog viacreate_or_update_llmobs_evaluator. These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.
Usage
/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]
Arguments: $ARGUMENTS