what-works-feedback-judge
What-Works Feedback Judge
A skill that pairs a four-question feedback frame (Working / Not working / Missing / Confusing) with the evidence-based scoring methodology from Don't Let the LLM Pick a Number. The LLM never picks a score directly. It collects discrete evidence items with bounded magnitudes in each bucket. Math computes the readiness score. The user gets a number plus four grouped action lists.
Why this exists
Most LLM feedback is either vague praise ("this is great!") or unstructured critique. Both waste the user's time. This skill forces:
- Specificity. Each bucket demands concrete items with discrete magnitudes. No hand-waves.
- Balance. The four-bucket structure prevents pure-praise or pure-critique answers.
- Depth via density. A confidence multiplier penalizes shallow critique — the formula already knows that 3 evidence items isn't enough engagement to be confident in a verdict.
- Measurability. When the user re-runs after edits, the v1 → v2 score delta tells them whether the revision actually moved the needle.
The methodology comes from the paper Don't Let the LLM Pick a Number: models are unreliable at picking numeric scores directly, but reliable at collecting bounded evidence items. Math turns the items into a defensible number.
When to trigger
Trigger this skill whenever the user shares any productive-output idea and wants feedback. This includes: