review-llm-annotations-and-improve-prompt

Improve Metric

Analyze human review annotations against machine-generated scores for a single text LLM judge metric. Calculate agreement, identify patterns in disagreements, and propose an improved metric prompt based on transcript analysis and reviewer feedback.

Critical Rules

NEVER update the metric prompt without explicit user approval. Always present the proposed changes and wait for confirmation before running coval metrics update.
Only operate on one metric at a time. If the user provides a project with multiple metrics, ask which one to focus on.
Only support text LLM judge metrics. Valid types: METRIC_LLM_BINARY, METRIC_CATEGORICAL, METRIC_NUMERICAL_LLM_JUDGE. Reject audio metrics, toolcall, metadata, regex, pause, and composite types.
NEVER fabricate data. If a CLI command or API call fails, report the error — do not invent scores or transcripts.
Use the CLI first, fall back to curl only when the CLI does not return required fields (e.g., the LLM explanation under result.llm.answer_explanation).

Usage

/review-llm-annotations-and-improve-prompt <project_id> <metric_id>

Both arguments are required. If the user omits one, ask for it.

review-llm-annotations-and-improve-prompt

Improve Metric

Critical Rules

Usage

More from coval-ai/coval-external-skills

onboard

launch-run

coval-resources

quick-eval

download-audio

watch-run