llm-evaluation-guide

Installation
SKILL.md

LLM Evaluation Guide

A skill for evaluating and benchmarking large language models (LLMs) in research settings. Covers automatic metrics, human evaluation protocols, benchmark suites, evaluation pitfalls, and best practices for reporting LLM performance.

Evaluation Taxonomy

Types of Evaluation

1. Intrinsic evaluation:
   Measures model quality on its own terms
   - Perplexity, likelihood, calibration
   - Useful for comparing architectures and training procedures

2. Extrinsic evaluation:
   Measures model quality on downstream tasks
   - Task-specific benchmarks (QA, summarization, classification)
   - Closer to real-world usefulness
Installs
1
GitHub Stars
227
First Seen
Apr 2, 2026
llm-evaluation-guide — wentorai/research-plugins