eval-bootstrap
Eval Bootstrap — Generate Evaluators from Production Traces
Given a sample of production LLM traces, analyze input/output patterns and quality dimensions, then emit a ready-to-use evaluator suite. Three output modes:
sdk_code(default) — Python.pyfile using the Datadog Evals SDK (BaseEvaluator/LLMJudge) for offline experiments.data_only— self-contained JSON spec, framework-agnostic.publish— write online LLM-judge evaluators directly to Datadog viacreate_or_update_llmobs_evaluator. These run automatically on matching production spans or traces (no dataset, no task function). The skill auto-classifies each proposed evaluator as span-scoped or trace-scoped based on what the judgment requires (a per-LLM-call tone check vs. an agent goal completion that needs the whole trace) — the user accepts or overrides the classification at the proposal checkpoint.
Usage
/eval-bootstrap <ml_app> [--timeframe <window>] [--data-only | --publish]
Arguments: $ARGUMENTS
Inputs
| Input | Required | Default | Description |
More from datadog-labs/agent-skills
dd-pup
Datadog CLI (Rust). OAuth2 auth with token refresh.
655dd-apm
APM - install, onboard, instrument, enable, set up, configure, traces, services, dependencies, performance analysis. Use for any request involving Datadog APM setup, instrumentation (SSI, ddtrace, agent install), or analysis.
568dd-logs
Log management - search, archives, metrics, and cost control.
568dd-monitors
Monitor management - list, search, file-based create, and alerting best practices.
550agent-skills
Datadog skills for AI agents. Essential monitoring, logging, tracing and observability.
545dd-docs
Datadog docs lookup using docs.datadoghq.com/llms.txt and linked Markdown pages.
539