devtu-benchmark-harness
Installation
SKILL.md
Benchmark Harness — Continuous Improvement System
A 5-step feedback loop for improving ToolUniverse tools, skills, and plugin quality.
Note: This skill is dataset-agnostic. Per-benchmark score history, known-failing question IDs, and dataset-specific investigations belong in temp_docs_and_tests/benchmark_tracking/ (gitignored workfolder), NOT in this skill directory.
The Feedback Loop
1. RUN benchmark → 2. ANALYZE results → 3. DIAGNOSE failures → 4. FIX via devtu skill → 5. RETEST → repeat
Orchestrated runner (preferred)
One command does steps 0 (memorization audit), 1 (build), 2 (run), 3 (analyze), 4 (diagnose + extract failures):