auto-arena
Auto Arena Skill
End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:
- Generate queries — LLM creates diverse test queries from task description
- Collect responses — query all target endpoints concurrently
- Generate rubrics — LLM produces evaluation criteria from task + sample queries
- Pairwise evaluation — judge model compares every model pair (with position-bias swap)
- Analyze & rank — compute win rates, win matrix, and rankings
- Report & charts — Markdown report + win-rate bar chart + optional matrix heatmap
Prerequisites
# Install OpenJudge
pip install py-openjudge
More from agentscope-ai/openjudge
paper-review
>
11find-skills-combo
Discover and recommend **combinations** of agent skills to complete complex, multi-faceted tasks. Provides two recommendation strategies — **Maximum Quality** (best skill per subtask) and **Minimum Dependencies** (fewest installs). Use this skill whenever the user wants to find skills, asks "how do I do X", "find a skill for X", or describes a task that likely requires multiple capabilities working together. Also use when the user mentions composing workflows, building pipelines, or needs help across several domains at once — even if they only say "find me a skill". This skill supersedes simple single-skill search by decomposing the task into subtasks and assembling an optimal skill portfolio.
11bib-verify
>
10claude-authenticity
>
10ref-hallucination-arena
>
8openjudge
>
2