ag2-evaluation

Installation
SKILL.md

Evaluation — run, grade, and track an agent

When to use

  • Evaluate / test / benchmark an AG2 beta Agent, or build a regression / CI gate
  • Grade answers for correctness, tool use, cost, or subjective quality
  • Track a metric across versions (did this change help or regress?)

To compare two-plus builds head-to-head or on a leaderboard, use ag2-eval-comparison.

Install

pip install "ag2[openai,tracing]"

run_agent reconstructs each task's trace from OpenTelemetry spans, so the tracing extra is required. Run this install before delivering the code. If you cannot run commands, state the exact pip install command.

The loop — dataset, agent, scorers, run_agent

Installs
14
GitHub Stars
4
First Seen
May 28, 2026
ag2-evaluation — ag2ai/ag2-skills