langchain-incident-runbook

Installation
SKILL.md

LangChain Incident Runbook

Overview

Standard operating procedures for LangChain production incidents: provider outages, error rate spikes, latency degradation, memory issues, and cost overruns.

Severity Classification

Level Description Response Time Example
SEV1 Complete outage 15 min All LLM calls failing
SEV2 Major degradation 30 min >50% error rate, >10s latency
SEV3 Minor degradation 2 hours <10% errors, slow responses
SEV4 Low impact 24 hours Intermittent issues, warnings

Runbook 1: LLM Provider Outage

Detect

Related skills
Installs
26
GitHub Stars
2.2K
First Seen
Feb 18, 2026