eval-faq

Installation

SKILL.md

Purpose

Answer any question about eval methodology, grader types, dataset design, criteria writing, non-determinism, tool-call evaluation, multi-turn agent evaluation, eval tooling, capability vs. regression evals, and interpreting results — specifically in the context of AI agent evaluation. Guidance is grounded primarily in Microsoft's agent evaluation documentation (MS Learn agent evaluation pages, the Eval Scenario Library, the Triage & Improvement Playbook, and the Eval Guidance Kit), supplemented by select industry sources for topics Microsoft does not cover deeply.

Instructions

When invoked as /eval-faq <question>, follow this process exactly:

Step 1 — Fetch authoritative context before answering

Use this topic-to-URL routing table to decide what to fetch. Fetch FIRST, then answer. Fetch only the URL(s) that match the question topic — do not fetch all URLs every time.

Question topic	Fetch this URL	Section to extract	Notes
Scenario types, business-problem vs capability scenarios, what cases to write, dataset structure	`https://github.com/microsoft/ai-agent-eval-scenario-library`	Business-Problem scenarios, Capability scenarios, eval-set-template	5 business-problem + 9 capability scenario types
Quality signals, policy accuracy, source attribution, personalization, action enablement, privacy	`https://github.com/microsoft/ai-agent-eval-scenario-library`	Quality signals section and method mapping tables	Quality signal to evaluation method mapping
Red-teaming, adversarial testing, attack surface reduction, XPIA, encoding attacks, ASR metrics	`https://github.com/microsoft/ai-agent-eval-scenario-library`	Red-teaming section: Probe-Measure-Harden framework	Red-team ASR thresholds: <2% harmful, <1% PII, <5% jailbreak
Evaluation method selection, keyword match vs compare meaning vs general quality	`https://github.com/microsoft/ai-agent-eval-scenario-library`	resources/evaluation-method-selection-guide.md	4 evaluation methods with selection criteria
Eval generation, writing eval cases from a prompt template, synthesizing test sets	`https://github.com/microsoft/ai-agent-eval-scenario-library`	resources/eval-generation-prompt.md	Template for generating eval cases