Design-AI-Benchmarking Skill

Purpose

This skill pressure-tests an AI-vs-human-expert benchmark before any ratings are collected, so that the comparison is fair, the rubric measures distinct constructs, the scale is calibrated, and the reported reliability is interpretable. It is the AI-evaluation specialization of /design-study: where /design-study reviews a study in general, this skill owns the specific machinery of comparing AI system(s) to a panel of human experts (or to each other) on rated outputs.

Use it when:

one or more AI systems will be scored against a human-expert reference (reader study, annotation panel, AI-output evaluation, model-vs-model bench)
a rubric and rating protocol must be locked before reviewers begin
a benchmark feels vulnerable to "the highest score is just the most tautological item" or "low agreement, but we cannot tell why" criticism
a reviewer or editor asks how the evaluation controlled for rater drift, leakage, or judge bias

design-ai-benchmarking

Design-AI-Benchmarking Skill

Purpose