eval-result-interpreter
Purpose
This skill takes eval results — a Copilot Studio evaluation CSV file, a pasted summary, or plain-English description of results — and produces a structured triage report. It is the final step in the eval lifecycle: plan → generate → run → interpret. The output tells you whether to ship, what broke, why it broke, and what to fix first.
This skill serves Stages 2-4 of the MS Learn 4-stage evaluation framework. In Stage 2 (Set Baseline & Iterate), it interprets your first eval results and guides fixes. In Stage 3 (Systematic Expansion), it identifies coverage gaps worth expanding into. In Stage 4 (Operationalize), it triages regression failures after agent updates. Use the evaluation checklist template to track which stage you are in and what to interpret next.
Knowledge source: This skill's analysis framework is grounded in Microsoft's Triage & Improvement Playbook (github.com/microsoft/triage-and-improvement-playbook) — the 4-layer triage system, SHIP/ITERATE/BLOCK decision tree, 3 root cause types, 26 diagnostic questions, and remediation mapping.
When to use this skill vs. eval-triage-and-improvement
These two skills share the same triage framework but serve different modes of work:
| Use eval-result-interpreter when… | Use eval-triage-and-improvement when… |
|---|---|
| You have a CSV file or concrete results and want a one-shot structured report | You want interactive guidance walking through diagnosis step by step |
| This is your first look at results — you need a verdict and top actions fast | You are in an ongoing improvement loop — fixing, re-running, and re-triaging |
| You want a customer-deliverable artifact (the .docx triage report) | You need detailed remediation help for specific quality signals (e.g., "wrong tool fires — now what?") |
| The eval run is relatively straightforward (<20 failures) | You have many failures (15+) and need help prioritizing which to investigate |
| You need the activity map / result comparison tool recommendations inline | You need the playbook worked examples and deeper diagnostic walkthroughs |
More from microsoft/eval-guide
eval-generator
Generates eval test cases from an eval suite plan (output of /eval-suite-planner) or a plain-English agent description. Supports both single-response and conversation (multi-turn) evaluation modes. Outputs a Copilot Studio test set table, a CSV file for import (single-response only), and a docx report for human review.
31eval-faq
Answers AI agent evaluation methodology questions with practical, opinionated guidance grounded primarily in Microsoft's agent evaluation ecosystem (MS Learn, Eval Scenario Library, Triage & Improvement Playbook, Eval Guidance Kit) supplemented by select industry sources.
29eval-suite-planner
Produces a concrete eval suite plan grounded in Microsoft's Eval Scenario Library and MS Learn agent evaluation guidance — scenario types, evaluation methods, quality signals, thresholds, and priority order — before any test cases are generated or evals are run.
28eval-triage-and-improvement
Use this skill when the user''s Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.
25eval-guide
Eval enablement accelerator — help customers think through "what does good look like" for their AI agent, then generate a structured eval plan and test cases they can use immediately. No running agent required. Works from a description, an idea, or even a vague goal. Use when anyone mentions agent evaluation, eval planning, "what should we test", "how do we know if the agent is good", test case generation, or interpreting eval results.
17