Skill eval & improve

Improve skills measurably: baseline → measure → bounded edit → re-validate. Combine local tooling, Codex plugin-eval (when installed), and research-backed loops (SkillOpt).

When to use

Skill triggers wrong or never loads (description routing)
Bloated SKILL.md, high token cost, weak outcomes
After adding a new procedure—need regression checks
Porting patterns from product MCP / plugin-eval research into Guild skills

When not to use

Bulk repo validation — e.g. “validate every skill in this repo” → pnpm run validate only (skill-spec-review for audit); do not start benchmark or SkillOpt loops.
Automated SkillOpt / cluster training — Guild documents a manual bounded-edit loop; no overnight optimizer pipeline.
Creating a new skill — use create-skill first; eval-improve applies after a skill exists.

Cursor scope (optional): activate when editing under skills/** or scripts/validate-skills.mjs.

skill-eval-improve

Skill eval & improve

When to use

When not to use