agentflow-evals

Installation

SKILL.md

Agentflow Evals

Agentflow evals are offline workflow benchmarks. They run normal Agentflow graphs through local scenario environments, compare variants across repeated trials, grade hard facts with required criteria, rate qualitative behavior with quality criteria, evaluate trajectories when useful, simulate tools deterministically, and write auditable reports.

Use this skill for agentflow eval. Use agentflow for graph authoring and run debugging. Use agentflow-plugins for plugin workflows and plugin-bundled tools.

Must Know

Evals are workflow tests for graphs, plugin workflows, prompt packs, supervisor recovery, tool behavior, and delivery auditability.
Evals do not change the graph contract; they run normal graphs in controlled scenario environments.
Scenarios should be realistic, local, reproducible, hard but solvable, and clear enough for two reviewers to grade the same way.
Required deterministic criteria own hard blockers. Quality criteria judge behavior and prompt feedback; they never excuse blockers.
Repeated trials matter when model variance matters.
Prefer local repos, local docs fixtures, tool fixtures, and deterministic simulation over live public services.
Capability suites can start below 100% pass rate. Regression gates should be stable and near 100%.
Do not call a suite ready until validate, a single trial, report, inspect, and compare produce useful artifacts.

Route By Task

Related skills

More from koji98/agentflow

Installs

Repository

koji98/agentflow

First Seen

14 days ago

Security Audits

Gen Agent Trust HubPass

SocketPass

SnykWarn

agentflow-evals

Agentflow Evals

Must Know

Route By Task

More from koji98/agentflow

agentflow

agentflow-plugins

agentflow-run-debugging

agentflow-graph-authoring

agentflow-managed-workflows

agentflow-grill-me