review-llm-annotations-and-improve-prompt
Improve Metric
Analyze human review annotations against machine-generated scores for a single text LLM judge metric. Calculate agreement, identify patterns in disagreements, and propose an improved metric prompt based on transcript analysis and reviewer feedback.
Critical Rules
- NEVER update the metric prompt without explicit user approval. Always present the proposed changes and wait for confirmation before running
coval metrics update. - Only operate on one metric at a time. If the user provides a project with multiple metrics, ask which one to focus on.
- Only support text LLM judge metrics. Valid types:
METRIC_LLM_BINARY,METRIC_CATEGORICAL,METRIC_NUMERICAL_LLM_JUDGE. Reject audio metrics, toolcall, metadata, regex, pause, and composite types. - NEVER fabricate data. If a CLI command or API call fails, report the error — do not invent scores or transcripts.
- Use the CLI first, fall back to curl only when the CLI does not return required fields (e.g., the LLM explanation under
result.llm.answer_explanation).
Usage
/review-llm-annotations-and-improve-prompt <project_id> <metric_id>
Both arguments are required. If the user omits one, ask for it.
More from coval-ai/coval-external-skills
onboard
>
15launch-run
Launch a Coval evaluation run against an AI agent. Use when user wants to start an evaluation, test an agent, or run simulations.
13coval-resources
Comprehensive overview of ALL Coval platform resources, their hierarchy, relationships, API endpoints, and ID formats. Use when user asks about Coval resources, data model, how things relate, what endpoints exist, or needs context about the platform structure before making API calls.
13quick-eval
Full evaluation workflow - launch a run, watch progress, and summarize results. Use for end-to-end agent testing.
13download-audio
Download audio recordings from Coval voice simulations. Use when user wants to listen to or analyze call recordings.
13watch-run
Monitor a Coval run's progress with live updates. Use when user wants to check run status or wait for completion.
13