evaluation

Installation

SKILL.md

Evaluation Methods for Agent Systems

Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects.

When to Activate

Activate this skill when:

Testing agent performance systematically
Validating context engineering choices
Measuring improvements over time
Catching regressions before deployment
Building quality gates for agent pipelines
Comparing different agent configurations
Evaluating production systems continuously

Do not activate this skill for adjacent work owned by other skills:

Designing the LLM judge itself, pairwise comparison, judge calibration, or bias mitigation: advanced-evaluation.
Debugging a specific context failure mode before measuring it: context-degradation.

Installs

107

Repository

shipshitdev/library

GitHub Stars

30

First Seen

Jan 20, 2026

Security Audits

Gen Agent Trust HubPass

evaluation — shipshitdev/library