benchmark-skills
Benchmark Skills
Write evals for skills and run the benchmark harness to measure whether a skill actually helps compared to baseline (no skill).
The Core Principle
Only two types of skills produce measurable benchmark delta:
- Behavioral suppression — The skill suppresses patterns the model naturally produces. The baseline consistently exhibits the bad behavior; the skill stops it. This is the highest-signal category.
- Genuinely novel knowledge — The skill injects domain knowledge NOT in the model's training data. If a knowledgeable human would need to look it up, the model probably doesn't know it either.
What does NOT produce delta (don't waste time benchmarking these):
- Knowledge the model already has (common frameworks, well-known patterns)
- General quality improvement without a specific behavioral target
- Skills requiring real system access (filesystem, APIs, browsers)
- Skills requiring multi-turn interaction
Pre-Flight Checklist
More from b-open-io/prompts
frontend-performance
This skill should be used when the user wants to optimize Next.js frontend performance using Lighthouse, bundle analysis, and animation best practices. Use when diagnosing slow pages, optimizing bundle size, or improving Core Web Vitals (LCP, TBT, CLS).
96statusline-setup
This skill should be used when the user asks to "create a status line", "customize status line", "set up statusline", "configure Claude Code status bar", "install ccstatusline", "add project colors to status line", "show git branch in status", "display token usage", or mentions Peacock colors, powerline, or status line configuration.
95x-research
AI-powered X/Twitter research via xAI Grok. Returns AI SUMMARIES with analysis, not raw tweets. Use for "what's trending", "social sentiment", "summarize X discussion about", "analyze X conversation about", "research topic on X". For RAW tweet data, use x-user-timeline, x-tweet-search, x-tweet-fetch instead. Requires XAI_API_KEY.
88npm-publish
This skill should be used when the user wants to publish a package to npm, bump a version, release a new version, or mentions "npm publish", "bun publish", "version bump", or "release to npm". Handles version bumping, changelog updates, git push, npm publishing, and automatic token rotation via agent-browser when auth expires. Do not trigger for unrelated uses of "release" (e.g. GitHub releases, press releases).
86geo-optimizer
This skill should be used when the user asks to "audit for AI visibility", "optimize for ChatGPT", "check GEO readiness", "analyze hedge density", "generate agentfacts", "check if my site works with AI search", "test LLM crawlability", "check discovery gap", or mentions Generative Engine Optimization, AI crawlers, Perplexity discoverability, or NANDA protocol.
84x-tweet-search
Search recent X/Twitter posts by query. Returns RAW TWEETS (last 7 days). Use when user asks "search X for", "find tweets about", "what are people saying about", "Twitter search", "raw tweets about". For AI summaries/sentiment, use x-research instead. Requires X_BEARER_TOKEN.
71