exploring-llm-evaluations
Originally fromposthog/skills
Installation
SKILL.md
Exploring AI observability evaluations
PostHog evaluations score $ai_generation events. Each evaluation is one of three
types:
hog— deterministic Hog code that returnstrue/false(and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.llm_judge— an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.sentiment— classifies sentiment from user messages on each matching generation. Returns a sentiment label and score, not a pass/fail verdict.