To build great AI products, you must transition from subjective "vibe checks" to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.

Phase 1: Open Coding (The "Benevolent Dictator" Phase)

Before automating, you must manually ground yourself in the data. Appoint one "Benevolent Dictator"—typically the Product Manager or domain expert—to define "good" taste.

Sample the Data: Extract 50–100 "traces" (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an "Open Code") describing the first thing that went wrong.
- Rule: Don't overthink it. Use specific language (e.g., "hallucinated virtual tour," "didn't confirm call transfer") rather than just "bad."
Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).

Phase 2: Axial Coding (Categorization)

Synthesize your mess of notes into actionable categories using an LLM.

Export Notes: Put your open codes into a CSV or spreadsheet.
Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5–7 "Axial Codes" (failure categories).
- Prompt Pattern: "Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem."
Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.

ai-error-analysis-and-eval-design

Phase 1: Open Coding (The "Benevolent Dictator" Phase)

Phase 2: Axial Coding (Categorization)