eval-recipes-runner

Installation
SKILL.md

eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

  • User asks to "test with eval-recipes"
  • User says "run the evals" or "benchmark this change"
  • User wants to validate improvements against codex/claude_code
  • Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

  1. Test specific amplihack branches
  2. Compare against baseline agents (codex, claude_code)
Related skills
Installs
105
GitHub Stars
62
First Seen
Jan 23, 2026