langchain-incident-runbook
Installation
SKILL.md
LangChain Incident Runbook
Overview
Standard operating procedures for LangChain production incidents: provider outages, error rate spikes, latency degradation, memory issues, and cost overruns.
Severity Classification
| Level | Description | Response Time | Example |
|---|---|---|---|
| SEV1 | Complete outage | 15 min | All LLM calls failing |
| SEV2 | Major degradation | 30 min | >50% error rate, >10s latency |
| SEV3 | Minor degradation | 2 hours | <10% errors, slow responses |
| SEV4 | Low impact | 24 hours | Intermittent issues, warnings |
Runbook 1: LLM Provider Outage
Detect
Related skills