context-eval
Installation
SKILL.md
Context Eval
Run the same tasks with and without the harness, grade outputs, measure the delta. No measurable improvement → token tourism.
The Eval Loop
1. Define what you're evaluating (the harness)
2. Write 3-5 realistic task prompts
3. Define success criteria (assertions)
4. Run tasks WITH and WITHOUT (you MUST actually run — see Step 4)
5. Grade both against assertions
6. Compare: did the harness help?
7. If iterating: modify, repeat from step 4
Use tasks to track progress — multi-step; tracking prevents skipping.