evaluation
Installation
SKILL.md
Evaluation Methods for Agent Systems
Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects.
When to Activate
Activate this skill when:
- Testing agent performance systematically
- Validating context engineering choices
- Measuring improvements over time
- Catching regressions before deployment
- Building quality gates for agent pipelines
- Comparing different agent configurations
- Evaluating production systems continuously
Do not activate this skill for adjacent work owned by other skills:
- Designing the LLM judge itself, pairwise comparison, judge calibration, or bias mitigation:
advanced-evaluation. - Debugging a specific context failure mode before measuring it:
context-degradation.