ai-system-evaluation
Installation
SKILL.md
AI System Evaluation
Evaluating AI systems end-to-end.
Evaluation Criteria
1. Domain-Specific Capability
| Domain | Benchmarks |
|---|---|
| Math & Reasoning | GSM-8K, MATH |
| Code | HumanEval, MBPP |
| Knowledge | MMLU, ARC |
| Multi-turn Chat | MT-Bench |