production-error-handling

Installation
SKILL.md

Production Error Handling

This skill encodes battle-tested error handling patterns for systems that must stay up when everything around them is falling apart. Every pattern here comes from real production incidents: the retry storm that turned a blip into a 4-hour outage, the bare except: pass that silently ate data for three weeks, the missing timeout that let a dead service hold open 200 connections until the pool starved. Follow this guide and none of that happens on your watch.


1. Error Taxonomy

Classify every error BEFORE writing handling code. Different errors demand different responses. Treating them the same is how you turn a recoverable hiccup into a cascading outage.

The Four Categories

Category Response Retry? Alert? Examples
Transient Retry with backoff Yes After N failures Network timeout, 503, connection reset, rate limited (429)
Permanent Fail immediately Never On unexpected frequency 400, 401, 404, validation error, malformed input
Partial Degrade gracefully Optional Low priority Cache miss, analytics down, email service down
Fatal Crash fast Never Immediate (PagerDuty) Missing config, corrupt state, OOM, disk full
Related skills

More from vstorm-co/production-stack-skills

Installs
1
GitHub Stars
14
First Seen
1 day ago