behavioral-evals
Behavioral Evals
Overview
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
🔄 Workflow Decision Tree
- Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
- Is it UI/Interaction heavy?
- Yes -> Use
appEvalTest(AppRig). See creating.md. - No -> Use
evalTest(TestRig). See creating.md.
- Yes -> Use
More from google-gemini/gemini-cli
code-reviewer
Use this skill to review code. It supports both local changes (staged or working tree)
6.5Kdocs-writer
Always use this skill when the task involves writing, reviewing, or editing
2.5Kpr-creator
Use this skill when asked to create a pull request (PR). It ensures all PRs
1.7Kdocs-changelog
>-
900pr-address-comments
Use this skill if the user asks you to help them address GitHub PR comments for their current branch of the Gemini CLI. Requires `gh` CLI tool.
657pirate-skill
Speak like a pirate.
598