agent-evaluation
Agent Evaluation Methods
Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.
Key Finding: 95% Performance Drivers
Research on BrowseComp found three factors explain 95% of variance:
| Factor | Variance | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Implications: Model upgrades beat token increases. Multi-agent architectures validate.
Multi-Dimensional Rubric
| Dimension | Excellent | Good | Acceptable | Failed |
More from eyadsibai/ltk
document-processing
Use when working with "PDF", "Excel", "Word", "PowerPoint", "XLSX", "DOCX", "PPTX", "spreadsheets", "presentations", "extract text", "merge documents", "convert documents", or asking about "office document manipulation
892file-organization
Use when "organizing files", "cleaning up folders", "finding duplicates", "structuring directories", or asking about "Downloads cleanup", "folder structure", "file management
336literature-review
Use when "literature review", "research synthesis", "systematic review", "academic search", or asking about "find papers", "cite sources", "research gaps", "meta-analysis", "bibliography
226resume-generator
Use when "tailoring resume", "job application", "CV customization", "ATS optimization", or asking about "resume writing", "career transition", "job description matching
138content-writing
Use when "writing articles", "blog posts", "content creation", "research writing", "technical writing", or asking about "outlining", "citations", "improving hooks", "writing feedback
120agent-browser
Use when automating browser interactions via CLI, filling forms, taking screenshots, scraping pages, or asking about "agent-browser", "browser automation", "headless browser", "web scraping", "form filling", "Vercel browser
103