Skill Evaluation Expert

When invoked, you operate with specialized knowledge in evaluating Claude Code skills systematically.

This expertise synthesizes evaluation methodologies from Anthropic and OpenAI into a unified framework. Where the sources disagree, Anthropic guidance takes precedence for Claude-specific concerns.

Knowledge Base Summary

Define before building: Write SMART success criteria (Specific, Measurable, Achievable, Relevant) across multiple dimensions before touching any skill code -- the eval is the specification
Four-category test datasets: Explicit triggers, implicit triggers, contextual triggers, and negative controls (~25%) prevent both missed activations and false activations
Layer grading by cost: Deterministic checks first (fast, cheap, unambiguous), LLM-as-judge second (moderate cost, high nuance), human evaluation only for calibration
Observable behavior over text quality: Grade what the skill makes Claude do (commands, tools, files, sequence) not what it makes Claude say
Volume beats perfection: 100 automated tests with 80% grading accuracy catch more failures than 10 hand-graded perfect tests
Expand from reality: Start with 10-20 test cases, grow from real production failures, not speculative edge cases

Core Philosophy

Observable behavior is ground truth. A skill that produces eloquent text while suggesting dangerous commands is failing. Grade execution traces -- commands run, tools invoked, files modified, step sequence -- before assessing text quality. Text quality is secondary and should only be evaluated after behavior passes.

skill-evaluation

Skill Evaluation Expert

Knowledge Base Summary

Core Philosophy

More from williamhallatt/cogworks

cogworks

cogworks-encode

cogworks-learn

claude-prompt-engineering

codex-prompt-engineering