eval-faq
Purpose
Answer any question about eval methodology, grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, multi-turn agent evaluation, eval tooling, capability vs. regression evals, and interpreting results — specifically in the context of AI agent evaluation. Guidance is grounded primarily in Microsoft's agent evaluation documentation (MS Learn agent evaluation pages, the Eval Scenario Library, the Triage & Improvement Playbook, and the Eval Guidance Kit), supplemented by select industry sources for topics Microsoft does not cover deeply.
Instructions
When invoked as /eval-faq <question>, follow this process exactly:
Step 1 — Fetch authoritative context before answering
Use this topic-to-URL routing table to decide what to fetch. Fetch FIRST, then answer. Fetch only the URL(s) that match the question topic — do not fetch all URLs every time.
| Question topic | Fetch this URL | Section to extract | Notes |
|---|---|---|---|
| Scenario types, business-problem vs capability scenarios, what cases to write, dataset structure | https://github.com/microsoft/ai-agent-eval-scenario-library |
Business-Problem scenarios, Capability scenarios, eval-set-template | 5 business-problem + 9 capability scenario types |
| Quality signals, policy accuracy, source attribution, personalization, action enablement, privacy | https://github.com/microsoft/ai-agent-eval-scenario-library |
Quality signals section and method mapping tables | Quality signal to evaluation method mapping |
| Red-teaming, adversarial testing, attack surface reduction, XPIA, encoding attacks, ASR metrics | https://github.com/microsoft/ai-agent-eval-scenario-library |
Red-teaming section: Probe-Measure-Harden framework | Red-team ASR thresholds: <2% harmful, <1% PII, <5% jailbreak |
| Evaluation method selection, keyword match vs compare meaning vs general quality | https://github.com/microsoft/ai-agent-eval-scenario-library |
resources/evaluation-method-selection-guide.md | 4 evaluation methods with selection criteria |
| Eval generation, writing eval cases from a prompt template, synthesizing test sets | https://github.com/microsoft/ai-agent-eval-scenario-library |
resources/eval-generation-prompt.md | Template for generating eval cases |
More from microsoft/eval-guide
eval-generator
Generates eval test cases from an eval suite plan (output of /eval-suite-planner) or a plain-English agent description. Supports both single-response and conversation (multi-turn) evaluation modes. Outputs a Copilot Studio test set table, a CSV file for import (single-response only), and a docx report for human review.
31eval-result-interpreter
Analyzes Copilot Studio evaluation CSV results using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.
29eval-suite-planner
Produces a concrete eval suite plan grounded in Microsoft's Eval Scenario Library and MS Learn agent evaluation guidance — scenario types, evaluation methods, quality signals, thresholds, and priority order — before any test cases are generated or evals are run.
28eval-triage-and-improvement
Use this skill when the user''s Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.
25eval-guide
Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases they can use immediately. No running agent required. Works from a description, an idea, or even a vague goal. Use when anyone mentions agent evaluation, eval planning, "what should we test", "how do we know if the agent is good", test case generation, or interpreting eval results.
17