Exploring LLM evaluations

PostHog evaluations score $ai_generation events. Each evaluation is one of two types, both first-class:

hog — deterministic Hog code that returns true/false (and optionally N/A). Best for objective rule-based checks: format validation (JSON parses, schema matches), length limits, keyword presence/absence, regex patterns, structural assertions, latency thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this when the criterion can be expressed as code.
llm_judge — an LLM scores generations against a prompt you write. Best for subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic drift, instruction-following. Costs an LLM call per run and requires AI data processing approval at the org level.

Results from both types land in ClickHouse as $ai_evaluation events with the same schema, so the read/query/summary workflows are identical regardless of evaluator type — the only thing that changes is whether $ai_evaluation_reasoning was written by Hog code or by an LLM.

exploring-llm-evaluations

Exploring LLM evaluations

More from posthog/ai-plugin

instrument-product-analytics

instrument-llm-analytics

instrument-integration

instrument-error-tracking

instrument-feature-flags

instrument-logs