agent-evaluation

Installation

Summary

Framework for testing LLM agents across behavioral, capability, and reliability dimensions with production-focused evaluation patterns.

Covers five core evaluation areas: agent testing, benchmark design, capability assessment, reliability metrics, and regression testing
Emphasizes statistical test evaluation (multiple runs with distribution analysis) and behavioral contract testing over single-run or string-matching approaches
Includes adversarial testing patterns and guards against common pitfalls like benchmark overfitting, flaky tests, and data leakage
Designed to catch production failures that benchmarks miss, recognizing that LLM agent evaluation requires non-deterministic result handling

SKILL.md

Agent Evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks

Capabilities

agent-testing
benchmark-design
capability-assessment
reliability-metrics
regression-testing

Prerequisites

Knowledge: Testing methodologies, Statistical analysis basics, LLM behavior patterns
Skills_recommended: autonomous-agents, multi-agent-orchestration
Required skills: testing-fundamentals, llm-fundamentals

Scope

Related skills

More from sickn33/antigravity-awesome-skills

Installs

592

Repository

sickn33/antigra…e-skills

GitHub Stars

37.3K

First Seen

Jan 19, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass

agent-evaluation

Agent Evaluation

Capabilities

Prerequisites

Scope

More from sickn33/antigravity-awesome-skills

docker-expert

nodejs-best-practices

typescript-expert

api-security-best-practices

clean-code

nextjs-best-practices