agent-evaluation
Agent Evaluation Skill
Evaluate AI agent task execution using world-class LLM-as-judge patterns from DeepEval, RAGAS, and G-Eval frameworks.
Output Format
Evaluation results are saved to evals/results/eval-${yyyy-mm-dd-hh-mm}-${commit_id}.md
Results Table
| Task Input | Agent Output | Reflection Input | Reflection Output | Score | Verdict | Feedback |
|---|---|---|---|---|---|---|
| Create hello.js... | I've created hello.js with... | Task: Create hello.js Agent Output: ... | Task complete | 5/5 | COMPLETE | Agent produced output; Found completion indicators |
| Fix the bug... | I found the issue and... | Task: Fix bug Agent Output: ... | (none) | 3/5 | PARTIAL | Agent produced output; Missing reflection |
Run Evaluation
More from dzianisv/opencode-plugins
readiness-check
Verify all OpenCode plugin services are healthy and ready. Use when diagnosing plugin issues, after deployment, or when services like Whisper, TTS, Supabase, or Telegram aren't working.
3plugin-testing
Verify plugin spec requirements with actionable test cases. Use when testing reflection or TTS plugins, validating code changes, or running the test suite before deployment.
3feature-workflow
Standard workflow for developing features. Follow this process for all non-trivial changes - from planning through PR merge. Ensures proper testing, review, and CI verification.
3opencode-session-db
Read OpenCode sessions, messages, and tool outputs directly from the SQLite database at ~/.local/share/opencode/. Use when asked to "read opencode sessions", "query opencode db", "find old sessions", "search session history", "read message history", "export session", "inspect opencode data", "look up past conversations", or any task requiring direct access to OpenCode's local storage. Does NOT require a running OpenCode server.
3