Context Eval

Evaluate whether context engineering artifacts actually improve agent outcomes.

The Core Question

Every context harness — whatever its format or delivery mechanism — costs tokens and claims to produce better results. This skill answers the question: does it?

The method is simple: run the same tasks with and without the context, grade the outputs, measure the delta. If the context doesn't produce a measurable improvement, it's not context engineering — it's token tourism.

What You Can Evaluate

This skill works on any context artifact that shapes agent behavior, regardless of format, delivery mechanism, or which LLM runs it. If it occupies tokens in the agent's working memory and claims to improve outcomes, it's a harness and you can evaluate it.

Common examples include project-level rules and instructions, coding guidelines, domain documentation, retrieval-augmented generation pipelines, tool and integration configurations, few-shot examples, and system-level prompts — but the skill doesn't prescribe what the harness looks like. Step 1 discovers that.

context-eval

Context Eval

The Core Question

What You Can Evaluate

The Eval Loop

More from andurilcode/craftwork

deep-document-processor

summarizer

inversion-premortem

llms-txt-generator

context-compressor

probabilistic-thinking