benchmark-skills

Installation

SKILL.md

Benchmark Skills

Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).

The Core Principle

Only two types of skills produce measurable benchmark delta:

Behavioral suppression — The skill suppresses patterns the model naturally produces. The baseline consistently exhibits the bad behavior; the skill stops it. This is the highest-signal category.
Genuinely novel knowledge — The skill injects domain knowledge NOT in the model's training data. If a knowledgeable human would need to look it up, the model probably doesn't know it either.

What does NOT produce delta (don't waste time benchmarking these):

Knowledge the model already has (common frameworks, well-known patterns)
General quality improvement without a specific behavioral target
Skills requiring real system access (filesystem, APIs, browsers)
Skills requiring multi-turn interaction

Pre-Flight Checklist

Installs

11

Repository

b-open-io/prompts

GitHub Stars

13

First Seen

Mar 10, 2026

Security Audits

Gen Agent Trust HubPass

benchmark-skills — b-open-io/prompts