auto-arena

Installation
SKILL.md

Auto Arena Skill

End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:

  1. Generate queries — LLM creates diverse test queries from task description
  2. Collect responses — query all target endpoints concurrently
  3. Generate rubrics — LLM produces evaluation criteria from task + sample queries
  4. Pairwise evaluation — judge model compares every model pair (with position-bias swap)
  5. Analyze & rank — compute win rates, win matrix, and rankings
  6. Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap

Prerequisites

# Install OpenJudge
pip install py-openjudge
Related skills
Installs
10
GitHub Stars
602
First Seen
Mar 7, 2026