agent:eval
Agent Evaluation System
Guides the user through building a comprehensive evaluation system for their AI agent. Applies patterns 10-17 from "Patterns for Building AI Agents" (Bhagwat & Gienow, 2025): failure mode taxonomy, business metrics, cross-referencing, iterating against evals, test suites, SME labeling, production datasets, and live evaluation.
When to use
Use this skill when the user needs to:
- Define what "good" looks like for an AI agent
- Create a failure mode taxonomy
- Set up business metrics for agent performance
- Build an evaluation test suite
- Design SME labeling workflows
- Plan production data evaluation pipelines
Instructions
Step 1: Understand the Agent
Use the AskUserQuestion tool to gather context:
More from ikatsuba/skills
spec:design
Technical Design - generates architecture diagrams, interfaces, and data flow based on requirements and chosen research solutions. Use when designing how a feature will be built.
18git:amend
Amend Commit - modifies the last commit with staged changes or new message
15spec:requirements
Requirements Analysis - gathers requirements through structured questions and produces a requirements document with testable acceptance criteria. Use when starting a new feature spec or documenting requirements.
14spec:tasks
Task Breakdown - generates an implementation plan with tracked tasks based on requirements and design documents. Use when breaking down a design into actionable work items.
14git:commit
Smart Commit - stages all changes and creates a conventional commit
13spec:do-all
Execute All Tasks - runs all pending tasks from the tasks document, with parallel subtask execution when safe
9