What-Works Feedback Judge

A skill that pairs a four-question feedback frame (Working / Not working / Missing / Confusing) with the evidence-based scoring methodology from Don't Let the LLM Pick a Number. The LLM never picks a score directly. It collects discrete evidence items with bounded magnitudes in each bucket. Math computes the readiness score. The user gets a number plus four grouped action lists.

Why this exists

Most LLM feedback is either vague praise ("this is great!") or unstructured critique. Both waste the user's time. This skill forces:

Specificity. Each bucket demands concrete items with discrete magnitudes. No hand-waves.
Balance. The four-bucket structure prevents pure-praise or pure-critique answers.
Depth via density. A confidence multiplier penalizes shallow critique — the formula already knows that 3 evidence items isn't enough engagement to be confident in a verdict.
Measurability. When the user re-runs after edits, the v1 → v2 score delta tells them whether the revision actually moved the needle.

The methodology comes from the paper Don't Let the LLM Pick a Number: models are unreliable at picking numeric scores directly, but reliable at collecting bounded evidence items. Math turns the items into a defensible number.

When to trigger

Trigger this skill whenever the user shares any productive-output idea and wants feedback. This includes:

what-works-feedback-judge

What-Works Feedback Judge

Why this exists

When to trigger