llm-evaluation
SKILL.md
LLM Evaluation & Testing
Comprehensive guide to evaluating and testing LLM applications including prompt testing, output validation, hallucination detection, benchmark creation, A/B testing, and quality metrics.
Quick Reference
When to use this skill:
- Testing LLM application outputs
- Validating prompt quality and consistency
- Detecting hallucinations and factual errors
- Creating evaluation benchmarks
- A/B testing prompts or models
- Implementing continuous evaluation (CI/CD)
- Measuring retrieval quality (for RAG)
- Debugging unexpected LLM behavior