Evaluation — run, grade, and track an agent

When to use

Evaluate / test / benchmark an AG2 beta Agent, or build a regression / CI gate
Grade answers for correctness, tool use, cost, or subjective quality
Track a metric across versions (did this change help or regress?)

To compare two-plus builds head-to-head or on a leaderboard, use ag2-eval-comparison.

Install

pip install "ag2[openai,tracing]"

run_agent reconstructs each task's trace from OpenTelemetry spans, so the tracing extra is required. Run this install before delivering the code. If you cannot run commands, state the exact pip install command.

The loop — dataset, agent, scorers, run_agent

Installs

Repository

ag2ai/ag2-skills

GitHub Stars

First Seen

May 28, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass