agent-evaluation

Installation
SKILL.md

Agent Evaluation Methods

Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.

Key Finding: 95% Performance Drivers

Research on BrowseComp found three factors explain 95% of variance:

Factor Variance Implication
Token usage 80% More tokens = better performance
Tool calls ~10% More exploration helps
Model choice ~5% Better models multiply efficiency

Implications: Model upgrades beat token increases. Multi-agent architectures validate.

Multi-Dimensional Rubric

| Dimension | Excellent | Good | Acceptable | Failed |

Related skills
Installs
51
Repository
eyadsibai/ltk
GitHub Stars
4
First Seen
Jan 28, 2026