llm-evaluation

Installation

Summary

Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing.

Covers three evaluation approaches: automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall), human evaluation across dimensions like accuracy and coherence, and LLM-as-Judge for pointwise, pairwise, and reference-based scoring
Includes implementations for text generation, classification, and retrieval (RAG) evaluation with ready-to-use metric functions and custom metric support
Provides A/B testing framework with statistical significance testing, effect size calculation, and regression detection to catch performance drops before deployment
Integrates with LangSmith for dataset management and experiment tracking, plus benchmarking utilities for tracking progress over time

SKILL.md

LLM Evaluation

Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.

Fast, repeatable, scalable evaluation using computed scores.

Related skills

Installs

6.8K

Repository

GitHub Stars

35.3K

First Seen

Jan 20, 2026

Security Audits