ag2-eval-comparison

Installation
SKILL.md

Evaluation — comparing builds (variants & pairwise)

When to use

  • Rank N models / prompts / configs on a leaderboardrun_variants
  • Decide which of two is better, head-to-head → run_pairwise with pairwise_judge (LLM) or human_pairwise (people)

For running and grading a single agent (scorers, CI, persistence), use ag2-evaluation.

Install

pip install "ag2[openai,tracing]"

Required. Run this install before delivering the code. If you cannot run commands, state the exact pip install command.

Leaderboard — run_variants

Installs
13
GitHub Stars
4
First Seen
May 28, 2026
ag2-eval-comparison — ag2ai/ag2-skills