agent-evaluation
Comprehensive evaluation framework for designing, building, and monitoring AI agent performance across coding, conversational, research, and computer-use agents.
- Covers three grader types (code-based, model-based, human) with trade-offs and best practices for each agent category
- Provides an 8-step roadmap from initial task creation through production monitoring, including environment isolation, outcome-focused grading, and saturation detection
- Includes benchmarks for major agent types: SWE-bench for coding, WebArena for computer use, τ2-Bench for conversational agents
- Offers CI/CD integration patterns, A/B testing templates, and production sampling strategies for real-time quality monitoring
Agent Evaluation (AI Agent Evals)
Based on Anthropic's "Demystifying evals for AI agents"
When to use this skill
- Designing evaluation systems for AI agents
- Building benchmarks for coding, conversational, or research agents
- Creating graders (code-based, model-based, human)
- Implementing production monitoring for AI systems
- Setting up CI/CD pipelines with automated evals
- Debugging agent performance issues
- Measuring agent improvement over time
Core Concepts
Eval Evolution: Single-turn → Multi-turn → Agentic
More from supercent-io/skills-template
security-best-practices
Implement security best practices for web applications and infrastructure. Use when securing APIs, preventing common vulnerabilities, or implementing security policies. Handles HTTPS, CORS, XSS, SQL Injection, CSRF, rate limiting, and OWASP Top 10.
14.1Kdata-analysis
Analyze datasets to extract insights, identify patterns, and generate reports. Use when exploring data, creating visualizations, or performing statistical analysis. Handles CSV, JSON, SQL queries, and Python pandas operations.
13.8Kweb-accessibility
Implement web accessibility (a11y) standards following WCAG 2.1 guidelines. Use when building accessible UIs, fixing accessibility issues, or ensuring compliance with disability standards. Handles ARIA attributes, keyboard navigation, screen readers, semantic HTML, and accessibility testing.
12.7Kworkflow-automation
Automate repetitive development tasks and workflows. Use when creating build scripts, automating deployments, or setting up development workflows. Handles npm scripts, Makefile, GitHub Actions workflows, and task automation.
12.6Kcode-review
Conduct thorough, constructive code reviews for quality and security. Use when reviewing pull requests, checking code quality, identifying bugs, or auditing security. Handles best practices, SOLID principles, security vulnerabilities, performance analysis, and testing coverage.
12.5Kdatabase-schema-design
Design and optimize database schemas for SQL and NoSQL databases. Use when creating new databases, designing tables, defining relationships, indexing strategies, or database migrations. Handles PostgreSQL, MySQL, MongoDB, normalization, and performance optimization.
12.2K