design-ai-benchmarking
Installation
SKILL.md
Design-AI-Benchmarking Skill
Purpose
This skill pressure-tests an AI-vs-human-expert benchmark before any ratings are collected, so that
the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the
reported reliability is interpretable. It is the AI-evaluation specialization of /design-study: where
/design-study reviews a study in general, this skill owns the specific machinery of comparing AI
system(s) to a panel of human experts (or to each other) on rated outputs.
Use it when:
- one or more AI systems will be scored against a human-expert reference (reader study, annotation panel, AI-output evaluation, model-vs-model bench)
- a rubric and rating protocol must be locked before reviewers begin
- a benchmark feels vulnerable to "the highest score is just the most tautological item" or "low agreement, but we cannot tell why" criticism
- a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias