production-error-handling
Production Error Handling
This skill encodes battle-tested error handling patterns for systems that must stay up when everything around them is falling apart. Every pattern here comes from real production incidents: the retry storm that turned a blip into a 4-hour outage, the bare except: pass that silently ate data for three weeks, the missing timeout that let a dead service hold open 200 connections until the pool starved. Follow this guide and none of that happens on your watch.
1. Error Taxonomy
Classify every error BEFORE writing handling code. Different errors demand different responses. Treating them the same is how you turn a recoverable hiccup into a cascading outage.
The Four Categories
| Category | Response | Retry? | Alert? | Examples |
|---|---|---|---|---|
| Transient | Retry with backoff | Yes | After N failures | Network timeout, 503, connection reset, rate limited (429) |
| Permanent | Fail immediately | Never | On unexpected frequency | 400, 401, 404, validation error, malformed input |
| Partial | Degrade gracefully | Optional | Low priority | Cache miss, analytics down, email service down |
| Fatal | Crash fast | Never | Immediate (PagerDuty) | Missing config, corrupt state, OOM, disk full |
More from vstorm-co/production-stack-skills
production
Main orchestrator for the production-stack-skills pack. Routes /production subcommands to specialized skills. Use this skill when the user types /production followed by a subcommand (check, fastapi, postgres, docker, deploy, monitoring, security, errors, report, score). Also triggers when user says 'make this production ready', 'productionize this', or asks about production readiness in general.
1production-check
Full production readiness audit with 0-100 score — scans the entire project across security, error handling, observability, deployment readiness, database patterns, and container hygiene. Launches parallel analysis, classifies findings by severity, and produces a prioritized action plan. Use this skill when user says /production check, /production score, asks 'is this production ready', 'audit this project', 'how production ready is this', or wants a comprehensive codebase health check.
1production-review
Production-readiness code review that checks for security vulnerabilities, error handling, logging, configuration, performance, and operational concerns. Use this skill when the user asks for a code review, PR review, quality check, production readiness check, or says 'review this', 'is this production ready', 'check my code'. Also trigger when reviewing pull requests that touch backend services, APIs, or infrastructure code. Works with Python, Node.js, Go, and Java codebases.
1production-fastapi
Production-grade FastAPI patterns — structured logging, health checks, graceful shutdown, middleware, Pydantic v2, async patterns, error handling, and security hardening. Use this skill when the user is building or modifying a FastAPI application, working with Pydantic models, configuring Starlette middleware, setting up Uvicorn/Gunicorn, or asks about FastAPI best practices. Triggers when importing fastapi, starlette, pydantic, or uvicorn. Also trigger when user says /production fastapi. DO NOT trigger for Django or Flask unless explicitly asked.
1production-monitoring
Production observability — OpenTelemetry traces, structured logging, metrics, alerting, health endpoints, and SLO definition. Use this skill when the user mentions monitoring, observability, logging, metrics, traces, alerts, SLOs, or says /production monitoring. Triggers on observability discussions, OTEL setup, structured logging configuration, Prometheus/Grafana setup, or alerting rules.
1production-docker
Docker production hardening — multi-stage builds, non-root users, distroless images, BuildKit secrets, layer optimization, security scanning, and compose best practices. Use this skill when the user is creating or modifying Dockerfiles, docker-compose files, .dockerignore, or containerizing applications. Triggers on any Dockerfile, docker-compose.yml, .dockerignore, or when user mentions Docker, containers, or images. Also trigger when user says /production docker.
1