post-mortem
Post-Mortem (Blameless Incident Review)
Overview
A post-mortem is a structured, blameless review held after an incident, outage, regression, missed launch, or failed experiment. The goal is not to assign fault but to learn how the system (people, process, code, and organization) produced the outcome, and to commit to durable changes that reduce the chance of recurrence.
This skill operationalizes the Google SRE blameless post-mortem template, the Etsy "morgue" tradition, John Allspaw's "How Complex Systems Fail" reading, Charles Perrow's Normal Accident Theory, and Sidney Dekker's Field Guide to Understanding "Human Error". Where the companion discovery/pre-mortem/ skill imagines failure before it happens, post-mortem learns from failure that already did.
When to Use
- Severity 1 / Severity 2 incident — customer-facing outage, data loss, security event, payments failure, or significant regression.
- Sev 3 with novelty — even a small incident is worth a post-mortem if it surfaced a class of failure the team has not seen before.
- Missed launch or rolled-back release — the launch itself is the incident.
- Failed experiment with a negative business outcome — a pricing test that depressed revenue, an onboarding change that hurt activation.
- Customer escalation — an executive customer call where the product was the proximate cause.
- Near miss — the deploy that "almost" took the site down. Post-mortems on near misses are some of the highest-leverage learning a team gets.
If the incident is below the team's severity threshold and the team has seen the same class of failure recently, document the recurrence in the existing post-mortem rather than producing a new one.