agent-evaluation
Installation
SKILL.md
Agent Evaluation (AI Agent Evals)
Based on Anthropic's "Demystifying evals for AI agents"
When to use this skill
- Designing evaluation systems for AI agents
- Building benchmarks for coding, conversational, or research agents
- Creating graders (code-based, model-based, human)
- Implementing production monitoring for AI systems
- Setting up CI/CD pipelines with automated evals
- Debugging agent performance issues
- Measuring agent improvement over time
Core Concepts
Eval Evolution: Single-turn → Multi-turn → Agentic
Related skills
More from akillness/skills-template
backend-testing
>
71data-analysis
>
54plannotator
>
35task-planning
Plan and organize software development tasks effectively. Use when breaking down features, creating user stories, or planning sprints. Handles task breakdown, user stories, acceptance criteria, and backlog management.
35omc
Use when you need Teams-first multi-agent orchestration in Claude Code. Triggers on: omc, autopilot, ralph, ulw, ccg, team. 29+ specialized agents, smart model routing (Haiku→Opus), persistent execution loops, skill layers, real-time HUD.
33vibe-kanban
Manage AI coding agents on a visual Kanban board. Run parallel agents through a To Do→In Progress→Review→Done flow with automatic git worktree isolation and GitHub PR creation.
32