eval
EvalKit
Overview
EvalKit is a conversational evaluation framework for AI agents that guides you through creating robust evaluations using the Strands Evals SDK. Through natural conversation, you can plan evaluations, generate test data, execute evaluations, and analyze results.
How Users Interact with EvalKit
Users interact with EvalKit through natural conversation, such as:
- "Build an evaluation plan for my QA agent at /path/to/agent"
- "Generate test cases focusing on edge cases"
- "Run the evaluation and show me the results"
- "Analyze the evaluation results and suggest improvements"
EvalKit understands the evaluation workflow and guides users through four phases: Plan, Data, Eval, and Report.
Evaluation Workflow
More from mikeyobrien/ralph-orchestrator
ralph-loop
Run, monitor, resume, merge, and debug Ralph loops. Use this skill whenever the user asks to operate `ralph run` or `ralph loops`, inspect loop state, recover suspended loops, analyze diagnostics, or unblock merge queue issues.
94pdd
Transforms a rough idea into a detailed design document with implementation plan. Follows Prompt-Driven Development — iterative requirements clarification, research, design, and planning.
52ralph-hats
Create, inspect, validate, explain, and improve Ralph hat collections. Use this skill whenever the user asks to make or refine a `.ralph/hats/*.yml` workflow, debug hat routing, explain event topology, or tune a multi-hat Ralph run.
47tui-validate
Validates Terminal User Interface (TUI) output using freeze for screenshot capture and LLM-as-judge for semantic validation. Supports both visual (PNG/SVG) and text-based validation modes.
46code-assist
Guides implementation of code tasks using test-driven development in an Explore, Plan, Code, Commit workflow. Acts as a Technical Implementation Partner and TDD Coach — following existing patterns, avoiding over-engineering, and producing idiomatic, modern code.
43tmux-terminal
Interactive terminal control via tmux for TUI apps, prompts, and long-running CLI workflows.
38