benchmark-skills

Installation
SKILL.md

Benchmark Skills

Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).

The Core Principle

Only two types of skills produce measurable benchmark delta:

  1. Behavioral suppression — The skill suppresses patterns the model naturally produces. The baseline consistently exhibits the bad behavior; the skill stops it. This is the highest-signal category.
  2. Genuinely novel knowledge — The skill injects domain knowledge NOT in the model's training data. If a knowledgeable human would need to look it up, the model probably doesn't know it either.

What does NOT produce delta (don't waste time benchmarking these):

  • Knowledge the model already has (common frameworks, well-known patterns)
  • General quality improvement without a specific behavioral target
  • Skills requiring real system access (filesystem, APIs, browsers)
  • Skills requiring multi-turn interaction

Pre-Flight Checklist

Related skills

More from b-open-io/prompts

Installs
8
GitHub Stars
12
First Seen
Mar 10, 2026