evaluation
Evaluation Methods for Agent Systems
Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects.
When to Activate
Activate this skill when:
- Testing agent performance systematically
- Validating context engineering choices
- Measuring improvements over time
- Catching regressions before deployment
- Building quality gates for agent pipelines
- Comparing different agent configurations
- Evaluating production systems continuously
Core Concepts
Focus evaluation on outcomes rather than execution paths, because agents may find alternative valid routes to goals. Judge whether the agent achieves the right outcome via a reasonable process, not whether it followed a specific sequence of steps.
More from flora131/atomic
research-codebase
Document codebase as-is with research directory for historical context
180explain-code
Explain code functionality in detail.
176prompt-engineer
Create, improve, or optimize prompts using best practices
170gh-create-pr
Commit unstaged changes, push changes, submit a pull request.
169gh-commit
Create well-formatted commits with conventional commit format.
168context-compression
This skill should be used when the user asks to "compress context", "summarize conversation history", "implement compaction", "reduce token usage", or mentions context compression, structured summarization, tokens-per-task optimization, or long-running agent sessions exceeding context limits. A core context engineering skill — also activates when the user mentions "context engineering" or "context-engineering" in the context of managing token budgets and session longevity.
168