llm-evaluation
LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
When to Use This Skill
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
More from ericgrill/agents-skills-plugins
debugging-strategies
Master systematic debugging techniques, profiling tools, and root cause analysis to efficiently track down bugs across any codebase or technology stack. Use when investigating bugs, performance issues, or unexpected behavior.
10test-driven-development
Use when implementing any feature or bugfix, before writing implementation code
9subagent-driven-development
Use when executing implementation plans with independent tasks in the current session
9systematic-debugging
Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes
9nodejs-backend-patterns
Build production-ready Node.js backend services with Express/Fastify, implementing middleware patterns, error handling, authentication, database integration, and API design best practices. Use when creating Node.js servers, REST APIs, GraphQL backends, or microservices architectures.
9openapi-spec-generation
Generate and maintain OpenAPI 3.1 specifications from code, design-first specs, and validation patterns. Use when creating API documentation, generating SDKs, or ensuring API contract compliance.
8