ag2-eval-comparison
Installation
SKILL.md
Evaluation — comparing builds (variants & pairwise)
When to use
- Rank N models / prompts / configs on a leaderboard →
run_variants - Decide which of two is better, head-to-head →
run_pairwisewithpairwise_judge(LLM) orhuman_pairwise(people)
For running and grading a single agent (scorers, CI, persistence), use ag2-evaluation.
Install
pip install "ag2[openai,tracing]"
Required. Run this install before delivering the code. If you cannot run commands, state the exact
pip installcommand.