devtu-benchmark-harness

Installation
SKILL.md

Benchmark Harness — Continuous Improvement System

A 5-step feedback loop for improving ToolUniverse tools, skills, and plugin quality.

Note: This skill is dataset-agnostic. Per-benchmark score history, known-failing question IDs, and dataset-specific investigations belong in temp_docs_and_tests/benchmark_tracking/ (gitignored workfolder), NOT in this skill directory.

The Feedback Loop

1. RUN benchmark → 2. ANALYZE results → 3. DIAGNOSE failures → 4. FIX via devtu skill → 5. RETEST → repeat

Orchestrated runner (preferred)

One command does steps 0 (memorization audit), 1 (build), 2 (run), 3 (analyze), 4 (diagnose + extract failures):

Installs
62
GitHub Stars
1.5K
First Seen
May 21, 2026
devtu-benchmark-harness — mims-harvard/tooluniverse