agent-evaluation
Agent Evaluation
You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.
You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it
Capabilities
- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing
More from hainamchung/agent-assistant
spring-boot-engineer
Use when building Spring Boot 3.x applications, microservices, or reactive Java applications. Invoke for Spring Data JPA, Spring Security 6, WebFlux, Spring Cloud integration.
17embedded-systems
Use when developing firmware for microcontrollers, implementing RTOS applications, or optimizing power consumption. Invoke for STM32, ESP32, FreeRTOS, bare-metal, power optimization, real-time systems.
13expo-app-design
Build beautiful cross-platform mobile apps with Expo Router, NativeWind, and React Native.
13vulnerability-scanner
Advanced vulnerability analysis principles. OWASP 2025, Supply Chain Security, attack surface mapping, risk prioritization.
12copywriting
>
11cpp-pro
Write idiomatic C++ code with modern features, RAII, smart pointers, and STL algorithms. Handles templates, move semantics, and performance optimization.
11