exploring-llm-evaluations

Originally fromposthog/skills
Installation
SKILL.md

Exploring LLM evaluations

PostHog evaluations score $ai_generation events. Each evaluation is one of two types, both first-class:

  • hog — deterministic Hog code that returns true/false (and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
  • llm_judge — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.

Results from both types land in ClickHouse as $ai_evaluation events with the same schema, so the read/query/summary workflows are identical regardless of evaluator type — the only thing that changes is whether $ai_evaluation_reasoning was written by Hog code or by an LLM.

Related skills
Installs
33
GitHub Stars
34
First Seen
Apr 14, 2026