Context Eval

Run the same tasks with and without the harness, grade outputs, measure the delta. No measurable improvement → token tourism.

The Eval Loop

1. Define what you're evaluating (the harness)
2. Write 3-5 realistic task prompts
3. Define success criteria (assertions)
4. Run tasks WITH and WITHOUT (you MUST actually run — see Step 4)
5. Grade both against assertions
6. Compare: did the harness help?
7. If iterating: modify, repeat from step 4

Use tasks to track progress — multi-step; tracking prevents skipping.

Companion Files

Installs

Repository

andurilcode/skills

GitHub Stars

First Seen

Mar 18, 2026

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykPass