Staged Evaluation
Installation
SKILL.md
Staged Evaluation
A key optimization from HyperAgents: don't waste compute evaluating obviously broken mutations. Run a cheap quick check first, and only invest in full evaluation for promising candidates.
The Problem
Full evaluation is expensive:
- Running a full test suite takes minutes
- LLM-as-judge evaluations cost tokens
- Benchmark suites can take hours
- Most mutations (especially early ones) produce broken or worse code