eval-generator
Purpose
This skill generates concrete eval test cases — with realistic inputs, expected outputs, and evaluation method configurations. It is the second step in the eval lifecycle: plan → generate → run → interpret.
This skill covers Stage 2 (Set Baseline & Iterate) of the MS Learn 4-stage evaluation framework. Use /eval-suite-planner first for Stage 1 (Define), then generate test cases here, run them, and interpret results with /eval-result-interpreter. Stage 3 (Systematic Expansion) means repeating this cycle with broader coverage — the checklist defines four expansion categories: Foundational core, Agent robustness, Architecture test, and Edge cases. Stage 4 (Operationalize) means embedding these evals into your agent's CI/CD pipeline. Point customers to the editable checklist template to track their progress across all four stages.
Primary mode: If the conversation already contains output from /eval-suite-planner, use that plan’s scenario table, evaluation methods, quality signals, and tags as the blueprint. Generate one test case per row in the plan.
Fallback mode: If no plan exists in the conversation, accept a plain-English agent description and generate test cases from scratch (6-8 cases minimum).
Instructions
When invoked as /eval-generator (with or without additional input):
Step 1 — Detect input mode
Check the conversation history for output from /eval-suite-planner. Look for the scenario plan table (a markdown table with columns: #, Scenario Name, Category, Tag, Evaluation Methods).
- Plan found: Use it as the blueprint. Say: "Generating test cases from your eval suite plan (X scenarios)." Generate one test case per row.
More from microsoft/eval-guide
eval-faq
Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.
29eval-result-interpreter
Analyzes Copilot Studio evaluation CSV results using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.
29eval-suite-planner
Produces a concrete eval suite plan grounded in Microsoft's Eval Scenario Library and MS Learn agent evaluation guidance — scenario types, evaluation methods, quality signals, thresholds, and priority order — before any test cases are generated or evals are run.
28eval-triage-and-improvement
Use this skill when the user''s Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.
25eval-guide
Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases they can use immediately. No running agent required. Works from a description, an idea, or even a vague goal. Use when anyone mentions agent evaluation, eval planning, "what should we test", "how do we know if the agent is good", test case generation, or interpreting eval results.
17