Investigating Eval Results

This skill helps you diagnose why AI coding agents are failing evaluations, specifically looking for discrepancies between guided and unguided performance.

Core Philosophy: Immediate Resolution

Fix It Now: Do not create tracking issues or delay work. The goal of an investigation is to identify the root cause and implement the fix immediately in the active session.
Platform Boundary: When investigating an eval, strictly modify use-case specific files (i.e., task.md, grader.ts, expectations.md, demo apps, and guide.md). Do not attempt to fix bugs in the underlying platform infrastructure or Playwright test environment. If you identify infrastructure issues, note them clearly for the user and suggest filing an issue on GitHub for the engineering team, ensuring the use-case investigation remains focused and clean.
Success Rate Goal: The ultimate objective of every investigation is to achieve a 100% Guided Pass Rate. The unguided pass rate does not matter and can be ignored.
Autonomous Initiative & Iteration: An investigation is not a single pass. You must autonomously loop through fixing files, re-running evaluations, measuring progress, and rolling back failed attempts until you hit 100% success. Never stop early, and run tests multiple times to ensure your fix is consistently non-flaky.

Communication Protocol

Because evaluation runs (gd eval) take time, check in with the user approximately every 30 seconds to provide a helpful narrative summary of what the agent is currently doing.

Whenever you summarize progress during these check-ins, you MUST:

Include a direct quote or code block of the underlying log lines to substantiate your update.
Provide a clickable markdown link to the specific log file being referenced so the user can click through to see the full contents.

However, NEVER include timestamps in your updates, as they add absolutely zero value to the user.

investigating-eval-results

Investigating Eval Results

Core Philosophy: Immediate Resolution

Communication Protocol