Eval Bootstrap — Generate Evaluators from Production Traces

Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:

sdk_code (default) — Python .py file using the Datadog Evals SDK (BaseEvaluator / LLMJudge) for offline experiments.
data_only — self-contained JSON spec, framework-agnostic.
publish — write online LLM-judge evaluators directly to Datadog via create_or_update_llmobs_evaluator. These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.

Usage

/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]

Arguments: $ARGUMENTS

Inputs

Related skills

More from datadog-labs/agent-skills

Installs

Repository

datadog-labs/ag…t-skills

GitHub Stars

105

First Seen

Apr 16, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn

eval-bootstrap

Eval Bootstrap — Generate Evaluators from Production Traces

Usage

Inputs

More from datadog-labs/agent-skills

dd-pup

dd-apm

dd-logs

dd-monitors

agent-skills

dd-docs