agentflow-evals

Installation
SKILL.md

Agentflow Evals

Agentflow evals are offline workflow benchmarks. They run normal Agentflow graphs through local scenario environments, compare variants across repeated trials, grade hard facts with required criteria, rate qualitative behavior with quality criteria, evaluate trajectories when useful, simulate tools deterministically, and write auditable reports.

Use this skill for agentflow eval. Use agentflow for graph authoring and run debugging. Use agentflow-plugins for plugin workflows and plugin-bundled tools.

Must Know

  • Evals are workflow tests for graphs, plugin workflows, prompt packs, supervisor recovery, tool behavior, and delivery auditability.
  • Evals do not change the graph contract; they run normal graphs in controlled scenario environments.
  • Scenarios should be realistic, local, reproducible, hard but solvable, and clear enough for two reviewers to grade the same way.
  • Required deterministic criteria own hard blockers. Quality criteria judge behavior and prompt feedback; they never excuse blockers.
  • Repeated trials matter when model variance matters.
  • Prefer local repos, local docs fixtures, tool fixtures, and deterministic simulation over live public services.
  • Capability suites can start below 100% pass rate. Regression gates should be stable and near 100%.
  • Do not call a suite ready until validate, a single trial, report, inspect, and compare produce useful artifacts.

Route By Task

Related skills

More from koji98/agentflow

Installs
3
First Seen
14 days ago