usability-testing
Usability Testing
Coverage
Usability testing covers the evaluative research practice of watching people attempt realistic tasks on a prototype or product, then identifying the obstacles they encounter. The dominant method is the think-aloud protocol (Ericsson & Simon), where participants narrate their thoughts as they work, surfacing the mental model they are using and the points where it diverges from the design. Sessions are organized around task scenarios — short narratives that frame a goal without prescribing the steps ("you want to find out how much you owe in taxes this quarter") — and a moderator who maintains neutrality, resists answering questions, and prompts only with open-ended interventions like "what are you thinking now?" or "what did you expect to happen?".
The skill covers sample sizing. The widely-cited Nielsen/Landauer "5-user rule" estimates that 5 users surface ~85% of major usability problems for a homogeneous user group on a discrete task, with steeply diminishing returns afterward. The rule has important limits: it applies per distinct user segment, per discrete task scope, and to formative (iterative diagnostic) testing — not to summative (benchmark) studies, which require much larger samples for valid statistical comparison. Misapplying the 5-user rule to summative claims is a common error.
Findings are organized by severity rating (Nielsen's 0–4 scale: cosmetic, minor, major, catastrophic) so the team can triage. Task success rate, time on task, and standardized instruments like SUS (System Usability Scale, Brooke 1996) provide quantitative complements when needed. The practice distinguishes moderated sessions (richer data, higher cost, requires scheduling) from unmoderated tools (lower cost, scales to dozens of sessions, sacrifices the moderator's ability to follow up on surprises).
The skill also covers what NOT to do in a session: leading prompts, defending the design, explaining how the design "is supposed to work" when the participant gets stuck, and over-fitting interpretations to a single dramatic finding from one participant.
Philosophy
Usability testing is built on a humbling claim: designers and engineers cannot reliably predict where users will struggle. The mental models that make a design feel obvious to its creators are exactly the models a fresh user lacks, and only direct observation closes that gap. The discipline rejects "I think users will understand this" in favor of "we watched users; here is what happened." Each session that confirms the design entirely is mildly suspicious — either the tasks were too easy or the moderator was unintentionally helping.
The practice is opinionated about moderator behavior. The moderator's job is to be uninteresting — to let the silence sit, to let the participant struggle long enough for the obstacle to become visible, to not rescue. This is hard because the social instinct is to help, and the design instinct is to defend. A moderator who explains the design after a participant gets stuck has destroyed the evidence; the obstacle the participant just encountered is the finding, and it cannot be re-observed in that session.
More from jacob-balslev/skills
layout-composition
Use when deciding responsive page or screen structure: section order, scan pattern, grid/flex composition, breakpoints, viewport hierarchy, responsive media, and density. Do NOT use for user-goal decomposition (use `task-analysis`), navigation taxonomy (use `information-architecture`), visual polish (use `visual-design-foundations`), or component/token contracts (use `design-system-architecture`).
8context-graph
Use when designing or auditing the multi-graph context architecture of an AI-coding workspace: skill graph, document routing graph, memory index, script registry, and the cross-graph edges between them. Covers edge typing, orphan detection, connectivity health, deterministic graph synthesis signals, change-propagation checks, and drift or hub-and-spoke anti-patterns. Do NOT use for authoring one SKILL.md (use `skill-scaffold`), validating one skill (use `graph-audit`), live routing decisions (use `skill-router`), context-window budgeting (use `context-window`), or session load/drop choices (use `context-management`).
8visual-design-foundations
Use when designing or auditing visual craft: color palette, typography, spacing, elevation, rhythm, density, visual hierarchy, brand fit, contrast intent, and motion feel. Do NOT use for sign-system meaning (use `semiotics`), token/component architecture (use `design-system-architecture`), responsive structure (use `layout-composition`), or accessibility compliance (use `a11y`).
7project-knowledge-extraction
Use when extracting durable project knowledge from code, docs, issues, incidents, reports, screenshots, or conversations into reusable context such as skills, ADRs, glossaries, context docs, or memory. Do NOT use for writing a new skill contract (use `skill-scaffold`), maintaining library tooling (use `skill-infrastructure`), or generic documentation polish (use `documentation`).
6problem-framing
Use when a team is converging on solutions before agreeing on the problem, when a brief reads as a feature request, when symptoms and root needs are tangled, or when assumptions need surfacing before design work proceeds. Do NOT use for code-level bug triage, runtime failure diagnosis, or root-cause analysis of system errors — those are engineering investigation tasks, not design problem framing.
6ai-native-development
Use when reasoning about agent autonomy levels, designing auto-improve loops, evaluating AI-generated code quality, or measuring agent productivity in an LLM-assisted codebase. Covers Karpathy's three eras of software (1.0 explicit / 2.0 learned / 3.0 natural-language), the vibe-coding-vs-agentic-engineering distinction, the 0–5 autonomy slider with task-type recommendations, the one-asset / one-metric / one-time-box AutoResearch loop, Software 3.0 productivity metrics, and the documented quality regressions of ungated AI-generated code (the 'vibe hangover'). Do NOT use for choosing a specific autonomy-loop topology (use `agent-engineering`), for the per-prompt authoring discipline (use `prompt-craft`), or for reviewing the AI-generated code that comes out of a Software 3.0 workflow (use `code-review`).
6