agentic-eval-first-development
Agentic Eval-First Development
Evals are infrastructure, not afterthoughts. Define success criteria before writing prompts or task logic. The eval becomes the spec.
Framework: Data → Task → Scores
Every eval has exactly three components:
- Data — Golden dataset of inputs (the test cases)
- Task — The operation being evaluated (LLM call, agent workflow, MCP pipeline)
- Scores — Categorical rubric that maps outputs to normalized 0–1 values
Step 1: Define the PRD (Data & Scores)
Build the Golden Dataset
Collect or generate 10–20 representative inputs covering the full range of expected usage.
More from vishalsachdev/claude-code-skills
paper-writing
Expert guidance for writing high-quality academic and research papers. Use when the user wants to write, structure, revise, or improve academic papers, research articles, conference papers, or technical reports. Provides comprehensive support for all stages from planning to final polish.
749formbuilder-admin
>
5llm-council
Convene a 3-model council (Claude + GPT via codex CLI + Gemini CLI) on a high-stakes decision. Forces cross-critique between members and surfaces where they actually disagree, breaking Claude's default agreeableness. Use when the user asks to "convene a council", "get a second opinion", "ask GPT and Gemini", "what would other models say", or has an architecture / strategy / hiring / pricing decision where being wrong is expensive. Skip for factual questions, code with one right answer, or anything premortem-shaped (route to premortem skill instead).
1start-session
Use when user says "let's get started", "where are we", or at beginning of a session. Reads project context from CLAUDE.md, checks git status and recent commits, and provides orientation for the session. Works across all repo types (code, research, mixed).
1wrap-up-session
Use when user says "let's wrap up", "close shop", "done for today", or wants to end a session. Handles session wrap-up including git operations, documentation updates, roadmap updates, and preparing for next session. Works across all repo types.
1premortem
Run a premortem on a plan, launch, product, hire, strategy, or decision — assumes it failed 6 months from now and works backward to find every reason why, then produces a revised plan. Use when the user has a concrete plan or commitment with high cost-of-being-wrong and asks to "premortem", "stress test", "find blind spots", "poke holes", "what could kill this", or "what am I missing". Skip for vague ideas without a plan, simple feedback requests, factual questions, or already-irreversible decisions.
1