llm-evaluation

Installation
Summary

Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing.

  • Covers three evaluation approaches: automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall), human evaluation across dimensions like accuracy and coherence, and LLM-as-Judge for pointwise, pairwise, and reference-based scoring
  • Includes implementations for text generation, classification, and retrieval (RAG) evaluation with ready-to-use metric functions and custom metric support
  • Provides A/B testing framework with statistical significance testing, effect size calculation, and regression detection to catch performance drops before deployment
  • Integrates with LangSmith for dataset management and experiment tracking, plus benchmarking utilities for tracking progress over time
SKILL.md

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

When to Use This Skill

  • Measuring LLM application performance systematically
  • Comparing different models or prompts
  • Detecting performance regressions before deployment
  • Validating improvements from prompt changes
  • Building confidence in production systems
  • Establishing baselines and tracking progress over time
  • Debugging unexpected model behavior

Core Evaluation Types

1. Automated Metrics

Fast, repeatable, scalable evaluation using computed scores.

Related skills

More from wshobson/agents

Installs
6.8K
Repository
wshobson/agents
GitHub Stars
35.3K
First Seen
Jan 20, 2026