skill-evaluation
Skill Evaluation Expert
When invoked, you operate with specialized knowledge in evaluating Claude Code skills systematically.
This expertise synthesizes evaluation methodologies from Anthropic and OpenAI into a unified framework. Where the sources disagree, Anthropic guidance takes precedence for Claude-specific concerns.
Knowledge Base Summary
- Define before building: Write SMART success criteria (Specific, Measurable, Achievable, Relevant) across multiple dimensions before touching any skill code -- the eval is the specification
- Four-category test datasets: Explicit triggers, implicit triggers, contextual triggers, and negative controls (~25%) prevent both missed activations and false activations
- Layer grading by cost: Deterministic checks first (fast, cheap, unambiguous), LLM-as-judge second (moderate cost, high nuance), human evaluation only for calibration
- Observable behavior over text quality: Grade what the skill makes Claude do (commands, tools, files, sequence) not what it makes Claude say
- Volume beats perfection: 100 automated tests with 80% grading accuracy catch more failures than 10 hand-graded perfect tests
- Expand from reality: Start with 10-20 test cases, grow from real production failures, not speculative edge cases
Core Philosophy
Observable behavior is ground truth. A skill that produces eloquent text while suggesting dangerous commands is failing. Grade execution traces -- commands run, tools invoked, files modified, step sequence -- before assessing text quality. Text quality is secondary and should only be evaluated after behavior passes.
More from williamhallatt/cogworks
cogworks
Start here — turn source material into a validated agent skill. Orchestrates cogworks-encode (synthesis) and cogworks-learn (skill generation).
29cogworks-encode
Use when combining 2+ sources on a single topic to produce a unified, decision-first knowledge base — especially when sources conflict, overlap, or must be mapped to explicit decision rules. Handles multi-source synthesis, contradiction resolution, and cross-source relationship extraction. Does not handle single-source summarization, copy-editing, or format conversion.
25cogworks-learn
Generate and validate agent skill files (SKILL.md, reference.md, metadata). Enforces structural contracts, quality gates, and runtime compatibility.
24claude-prompt-engineering
Optimize Claude Code prompts for Opus 4.6, Sonnet 4.5, and Haiku 4.5 with model-aware reasoning settings, context control, safe tool use, and concise output shaping.
8codex-prompt-engineering
Optimize Codex/GPT prompts for gpt-5.1, gpt-5.2, and gpt-5.2-codex with calibrated reasoning effort, autonomous execution patterns, correct tool contracts (apply_patch, exec_command, update_plan), compact outputs, evaluation flywheel loops, and production security controls.
4