improvement-evaluator
Improvement Evaluator
Measures whether a Skill actually makes AI perform better on real tasks, not just whether the SKILL.md document looks well-structured.
Why Execution Testing Matters
Structural scoring (word count, section presence, formatting) correlates poorly with actual AI task performance. Internal benchmarks showed R²=0.00 between document-structure scores and execution pass rates across 40+ skill evaluations. A perfectly formatted SKILL.md can still produce failing task outputs if the instructions mislead the model or omit critical constraints.
Tradeoff: Execution testing is slower and more expensive than structural
checks because it invokes the AI model once per task. A 7-task suite at
pass@1 costs roughly 7 API calls per candidate plus 7 for the baseline.
This is acceptable because structural scoring alone gives no signal about
whether the skill actually works. To offset cost, the evaluator caches
baseline results for 7 days and supports --pass-k 1 (single attempt)
More from lanyasheng/auto-improvement-orchestrator-skill
skill-distill
|
1improvement-gate
当执行完变更需要验证是否应保留、候选被标记 pending 需要人工审批、或想查看待审队列时使用。6 层机械门禁: Schema→Compile→Lint→Regression→Review→HumanReview,其中 Schema/Compile/Regression/Review 为阻塞层(失败即拒绝),Lint 和 HumanReview 为建议层(失败不阻塞但记录警告)。不用于打分(用 improvement-discriminator)或执行变更(用 improvement-executor)。
1prompt-hardening
硬化 agent prompt、system prompt、SOUL.md、AGENTS.md、cron prompt 使 LLM 可靠遵循指令。触发词:agent 不听话、忽略规则、绕过约束、prompt 优化、指令合规、规则强化、prompt 硬化、LLM 不遵守、模型违规、creative circumvention。Use when agent ignores rules, disobeys instructions, bypasses tool constraints, needs prompt optimization, instruction compliance improvement, or rule hardening. 不适用于代码生成、代码审查、测试编写等执行型任务。参见 improvement-orchestrator (用于 skill 质量改进)、code-review-enhanced (用于代码审查)。
1benchmark-store
当需要初始化基准数据库、对比 skill 评分与历史基线、查看 Pareto front 是否有维度回退、或查阅质量分级标准时使用。不用于给候选打分(用 improvement-discriminator)或自动改进(用 improvement-learner)。
1skill-forge
>
1improvement-learner
当需要检查 skill 质量评分、自动优化 SKILL.md 结构、追踪评估分数变化趋势、或「评分低了想知道哪里扣分」时使用。6维结构评估 + HOT/WARM/COLD 三层记忆 + Pareto front。不用于候选语义打分(用 improvement-discriminator)或全流程编排(用 improvement-orchestrator)。
1