Improvement Evaluator

Measures whether a Skill actually makes AI perform better on real tasks, not just whether the SKILL.md document looks well-structured.

Why Execution Testing Matters

Structural scoring (word count, section presence, formatting) correlates poorly with actual AI task performance. Internal benchmarks showed R²=0.00 between document-structure scores and execution pass rates across 40+ skill evaluations. A perfectly formatted SKILL.md can still produce failing task outputs if the instructions mislead the model or omit critical constraints.

Tradeoff: Execution testing is slower and more expensive than structural checks because it invokes the AI model once per task. A 7-task suite at pass@1 costs roughly 7 API calls per candidate plus 7 for the baseline. This is acceptable because structural scoring alone gives no signal about whether the skill actually works. To offset cost, the evaluator caches baseline results for 7 days and supports --pass-k 1 (single attempt)

improvement-evaluator

Improvement Evaluator

Why Execution Testing Matters

More from lanyasheng/auto-improvement-orchestrator-skill

skill-distill

improvement-gate

prompt-hardening

benchmark-store

skill-forge

improvement-learner