go-prompt-sensitivity
GO Prompt Sensitivity Audit
v1.0 — March 2026
The Problem
We use LLMs to classify customer signals — detecting complaints, identifying vulnerability, scoring severity. It works, but we don't know how fragile it is.
Small prompt changes sometimes flip classifications: reword the system instruction, reorder the few-shot examples, change "classify" to "determine" — and suddenly edge cases go the other way. We need to understand how sensitive our classifier is to prompt variations, find the failure modes, and build a more robust prompt.
Our job: systematically test prompt variations against a labeled dataset, measure what breaks, understand why, and produce a prompt that's demonstrably more robust than what we started with.
How This Works
This is a pair exercise. The agent's job is to be a thinking partner — ask questions, provide scaffolding, help debug, challenge assumptions. Not to write the solution for you.