auto-arena

Installation

SKILL.md

Auto Arena Skill

End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:

Generate queries — LLM creates diverse test queries from task description
Collect responses — query all target endpoints concurrently
Generate rubrics — LLM produces evaluation criteria from task + sample queries
Pairwise evaluation — judge model compares every model pair (with position-bias swap)
Analyze & rank — compute win rates, win matrix, and rankings
Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap

# Install OpenJudge
pip install py-openjudge

Related skills

Installs

Repository

GitHub Stars

602

First Seen

Mar 7, 2026

Security Audits