aws-bedrock-evals

Installation
SKILL.md

AWS Bedrock Evaluation Jobs


Overview

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.

Pre-computed Inference vs Live Inference

Mode How it works Use when
Live Inference Bedrock generates responses during the eval job Simple prompt-in/text-out, no tool calling
Pre-computed Inference You pre-collect responses and supply them in a JSONL dataset Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock

Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.

Pipeline

Installs
9
GitHub Stars
2
First Seen
Feb 11, 2026
aws-bedrock-evals — antstackio/skills