exploring-llm-evaluations

Originally fromposthog/skills
Installation
SKILL.md

Exploring AI observability evaluations

PostHog evaluations score $ai_generation events. Each evaluation is one of three types:

  • hog — deterministic Hog code that returns true/false (and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
  • llm_judge — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.
  • sentiment — classifies sentiment from user messages on each matching generation. Returns a sentiment label and score, not a pass/fail verdict.
Installs
69
GitHub Stars
57
First Seen
Apr 14, 2026
exploring-llm-evaluations — posthog/ai-plugin