context-eval

Installation
SKILL.md

Context Eval

Evaluate whether context engineering artifacts actually improve agent outcomes.

The Core Question

Every context harness — whatever its format or delivery mechanism — costs tokens and claims to produce better results. This skill answers the question: does it?

The method is simple: run the same tasks with and without the context, grade the outputs, measure the delta. If the context doesn't produce a measurable improvement, it's not context engineering — it's token tourism.

What You Can Evaluate

This skill works on any context artifact that shapes agent behavior, regardless of format, delivery mechanism, or which LLM runs it. If it occupies tokens in the agent's working memory and claims to improve outcomes, it's a harness and you can evaluate it.

Common examples include project-level rules and instructions, coding guidelines, domain documentation, retrieval-augmented generation pipelines, tool and integration configurations, few-shot examples, and system-level prompts — but the skill doesn't prescribe what the harness looks like. Step 1 discovers that.

The Eval Loop

Related skills

More from andurilcode/craftwork

Installs
2
GitHub Stars
6
First Seen
Apr 2, 2026