agent-evaluation
Framework for testing LLM agents across behavioral, capability, and reliability dimensions with production-focused evaluation patterns.
- Covers five core evaluation areas: agent testing, benchmark design, capability assessment, reliability metrics, and regression testing
- Emphasizes statistical test evaluation (multiple runs with distribution analysis) and behavioral contract testing over single-run or string-matching approaches
- Includes adversarial testing patterns and guards against common pitfalls like benchmark overfitting, flaky tests, and data leakage
- Designed to catch production failures that benchmarks miss, recognizing that LLM agent evaluation requires non-deterministic result handling
Agent Evaluation
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks
Capabilities
- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing
Prerequisites
- Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns
- Skills_recommended: autonomous-agents, multi-agent-orchestration
- Required skills: testing-fundamentals, llm-fundamentals
Scope
More from sickn33/antigravity-awesome-skills
docker-expert
You are an advanced Docker containerization expert with comprehensive, practical knowledge of container optimization, security hardening, multi-stage builds, orchestration patterns, and production deployment strategies based on current industry best practices.
15.0Knodejs-best-practices
Node.js development principles and decision-making. Framework selection, async patterns, security, and architecture. Teaches thinking, not copying.
11.2Ktypescript-expert
TypeScript and JavaScript expert with deep knowledge of type-level programming, performance optimization, monorepo management, migration strategies, and modern tooling.
8.3Kapi-security-best-practices
Implement secure API design patterns including authentication, authorization, input validation, rate limiting, and protection against common API vulnerabilities
7.0Kclean-code
This skill embodies the principles of \"Clean Code\" by Robert C. Martin (Uncle Bob). Use it to transform \"code that works\" into \"code that is clean.\"
6.6Knextjs-best-practices
Next.js App Router principles. Server Components, data fetching, routing patterns.
5.2K