agentflow-evals
Agentflow Evals
Agentflow evals are offline workflow benchmarks. They run normal Agentflow graphs through local scenario environments, compare variants across repeated trials, grade hard facts with required criteria, rate qualitative behavior with quality criteria, evaluate trajectories when useful, simulate tools deterministically, and write auditable reports.
Use this skill for agentflow eval. Use agentflow for graph authoring and run debugging. Use agentflow-plugins for plugin workflows and plugin-bundled tools.
Must Know
- Evals are workflow tests for graphs, plugin workflows, prompt packs, supervisor recovery, tool behavior, and delivery auditability.
- Evals do not change the graph contract; they run normal graphs in controlled scenario environments.
- Scenarios should be realistic, local, reproducible, hard but solvable, and clear enough for two reviewers to grade the same way.
- Required deterministic criteria own hard blockers. Quality criteria judge behavior and prompt feedback; they never excuse blockers.
- Repeated trials matter when model variance matters.
- Prefer local repos, local docs fixtures, tool fixtures, and deterministic simulation over live public services.
- Capability suites can start below 100% pass rate. Regression gates should be stable and near 100%.
- Do not call a suite ready until validate, a single trial, report, inspect, and compare produce useful artifacts.
Route By Task
More from koji98/agentflow
agentflow
Use when authoring, validating, running, inspecting, or debugging supervised Agentflow graphs, managed patterns, plugin tools, delivery packages, supervisor interventions, or Codex/Cursor harness behavior.
10agentflow-plugins
Use when creating, reviewing, resolving, or consuming Agentflow plugin workflows or plugin-bundled CLI tools, including workflow manifests, lockfiles, tool config, and credential policy.
7agentflow-run-debugging
Inspect, explain, and debug Agentflow runs. Use when a run failed, resumed unexpectedly, or needs artifact-level diagnosis; when tracing state.json, events.jsonl, execution logs, context packets, or execution artifacts; or when deciding why passed work did or did not preserve on resume.
3agentflow-graph-authoring
Design, review, and refine Agentflow execution graphs. Use when authoring or editing Agentflow graph JSON, choosing between primitive nodes and managed workflows, or checking topology, profiles, context flow, outputs, and validation against the shipped runtime contract.
3agentflow-managed-workflows
Author and review Agentflow managed workflows. Use when choosing between deep_research, spec_design, execute_spec, and review_change, or when filling their brief, context_policy, approval_policy, strategy, delivery, and runtime fields.
3agentflow-grill-me
Use when the user wants to be grilled, interviewed, pressure-tested, or questioned before creating an Agentflow plan, graph, workflow, feature design, or implementation plan.
1