Nightly Evaluation Investigation

This skill automates the retrieval and multi-agent comparison of remote nightly evaluation runs from Google Cloud Storage (GCS) to diagnose system-wide guidance health, task drift, and over-prescribed guides.

Core Objectives

Cross-Agent Diagnostics: Compare results across three distinct, modern agents (Claude Code, Codex CLI, and Jetski CLI) to locate patterns that are agent-agnostic.
Guide Discovery Audit: Catch cases where agents skip the expected guide or over-retrieve irrelevant guides.
Health Thresholding: Flag tasks that either underperform under guidance or are too easy/over-prescriptive (meaning unguided runs already pass easily).
Structured Reporting: Produce reliable Markdown and JSON artifacts that downstream automation can easily ingest.

[!IMPORTANT] CRITICAL CONSTRAINT - READ-ONLY INVESTIGATION ONLY This skill is strictly for diagnostics, investigation, and suggesting recommendations. The agent MUST NOT under any circumstances edit, modify, create, delete, or touch any files under the guides/ directory (including task files, guides, graders, expectations, or demos). All suggestions for fixes must be documented exclusively in the generated investigation report markdown file under the "Actionable Recommendations" section. No automated or manual code remediation should be performed.

nightly-eval-investigation

Nightly Evaluation Investigation

Core Objectives

Quick Reference: Remote Dashboard Results (GCS)