llm-evaluation
Systematic evaluation of LLM applications using automated metrics, human feedback, and statistical testing.
- Covers three evaluation approaches: automated metrics (BLEU, ROUGE, BERTScore, accuracy, precision/recall), human evaluation across dimensions like accuracy and coherence, and LLM-as-Judge for pointwise, pairwise, and reference-based scoring
- Includes implementations for text generation, classification, and retrieval (RAG) evaluation with ready-to-use metric functions and custom metric support
- Provides A/B testing framework with statistical significance testing, effect size calculation, and regression detection to catch performance drops before deployment
- Integrates with LangSmith for dataset management and experiment tracking, plus benchmarking utilities for tracking progress over time
LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
When to Use This Skill
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
More from wshobson/agents
tailwind-design-system
Build scalable design systems with Tailwind CSS v4, design tokens, component libraries, and responsive patterns. Use when creating component libraries, implementing design systems, or standardizing UI patterns.
41.0Ktypescript-advanced-types
Master TypeScript's advanced type system including generics, conditional types, mapped types, template literals, and utility types for building type-safe applications. Use when implementing complex type logic, creating reusable type utilities, or ensuring compile-time type safety in TypeScript projects.
40.5Knodejs-backend-patterns
Build production-ready Node.js backend services with Express/Fastify, implementing middleware patterns, error handling, authentication, database integration, and API design best practices. Use when creating Node.js servers, REST APIs, GraphQL backends, or microservices architectures.
31.8Kpython-performance-optimization
Profile and optimize Python code using cProfile, memory profilers, and performance best practices. Use when debugging slow Python code, optimizing bottlenecks, or improving application performance.
22.1Kapi-design-principles
Master REST and GraphQL API design principles to build intuitive, scalable, and maintainable APIs that delight developers. Use when designing new APIs, reviewing API specifications, or establishing API design standards.
20.3Kpython-testing-patterns
Implement comprehensive testing strategies with pytest, fixtures, mocking, and test-driven development. Use when writing Python tests, setting up test suites, or implementing testing best practices.
19.7K