databricks-incident-runbook
Installation
SKILL.md
Databricks Incident Runbook
Overview
Rapid incident response for Databricks: triage script, decision tree, immediate actions by error type, communication templates, evidence collection, and postmortem template. Designed for on-call engineers to follow during live incidents.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Production pipeline down | < 15 min | Critical ETL failed, data not updating |
| P2 | Degraded performance | < 1 hour | Slow queries, partial failures, stale data |
| P3 | Non-critical issues | < 4 hours | Dev cluster issues, non-critical job delays |
| P4 | No user impact | Next business day | Monitoring gaps, cleanup needed |
Instructions
Step 1: Quick Triage (Run First)
#!/bin/bash
set -euo pipefail
Related skills