ai-error-analysis-and-eval-design
Installation
SKILL.md
To build great AI products, you must transition from subjective "vibe checks" to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.
Phase 1: Open Coding (The "Benevolent Dictator" Phase)
Before automating, you must manually ground yourself in the data. Appoint one "Benevolent Dictator"—typically the Product Manager or domain expert—to define "good" taste.
- Sample the Data: Extract 50–100 "traces" (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
- Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an "Open Code") describing the first thing that went wrong.
- Rule: Don't overthink it. Use specific language (e.g., "hallucinated virtual tour," "didn't confirm call transfer") rather than just "bad."
- Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).
Phase 2: Axial Coding (Categorization)
Synthesize your mess of notes into actionable categories using an LLM.
- Export Notes: Put your open codes into a CSV or spreadsheet.
- Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5–7 "Axial Codes" (failure categories).
- Prompt Pattern: "Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem."
- Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
- Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.