Evaluation — comparing builds (variants & pairwise)

When to use

Rank N models / prompts / configs on a leaderboard → run_variants
Decide which of two is better, head-to-head → run_pairwise with pairwise_judge (LLM) or human_pairwise (people)

For running and grading a single agent (scorers, CI, persistence), use ag2-evaluation.

pip install "ag2[openai,tracing]"

Required. Run this install before delivering the code. If you cannot run commands, state the exact pip install command.

Installs

Repository

GitHub Stars

First Seen

May 28, 2026

Security Audits