ai-evals
AI Evals
Scope
Covers
- Designing evaluation (“evals”) for LLM/AI features as an execution contract: what “good” means and how it’s measured
- Converting failures into a golden test set + error taxonomy + rubric
- Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
- Producing decision-ready results and an iteration loop (every bug becomes a new test)
When to use
- “Design evals for this LLM feature so we can ship with confidence.”
- “Create a rubric + golden set + benchmark for our AI assistant/copilot.”
- “We’re seeing flaky quality—do error analysis and turn it into a repeatable eval.”
- “Compare prompts/models safely with a clear acceptance threshold.”
More from oldwinter/skills
personal-productivity
Build a Personal Productivity System Pack (weekly timebox plan, capture+to-do system, daily/weekly review rituals, and a 7-day rollout). Use for timeboxing, calendar blocking, and staying on top of high-volume leadership work. Category: Career.
339github-cli
This skill should be used when users need to interact with GitHub via the gh CLI. It covers repository management (create, delete, clone, fork), CI/CD workflows (GitHub Actions), Issues, Pull Requests, Releases, and other GitHub operations. Triggers on requests mentioning GitHub, repos, PRs, issues, actions, or workflows.
115aws-cli
This skill should be used when users need to interact with AWS services via CLI. It covers all AWS services including EC2, ECS, EKS, Lambda, S3, RDS, DynamoDB, VPC, Route53, CloudFront, Bedrock, Support, Billing, and more. Supports querying, creating, modifying, deleting resources, monitoring, debugging, and cost analysis. Triggers on requests mentioning AWS, cloud resources, or specific AWS service names.
96positioning-and-messaging
Create product positioning and messaging frameworks with validation and iteration guidance.
82kubectl
This skill should be used when users need to interact with Kubernetes clusters via kubectl CLI. It covers pod management, deployment operations, log viewing, debugging, resource monitoring, scaling, ConfigMaps, Secrets, Services, and all standard kubectl operations. Supports multiple clusters (production, staging, local k3s) with predefined aliases. Triggers on requests mentioning Kubernetes, k8s, pods, deployments, containers, or cluster operations.
71setting-okrs-and-goals
Set measurable OKRs and goals with clear alignment, guardrails, and review cadence.
51