databricks-incident-runbook

Installation
SKILL.md

Databricks Incident Runbook

Overview

Rapid incident response for Databricks: triage script, decision tree, immediate actions by error type, communication templates, evidence collection, and postmortem template. Designed for on-call engineers to follow during live incidents.

Severity Levels

Level Definition Response Time Examples
P1 Production pipeline down < 15 min Critical ETL failed, data not updating
P2 Degraded performance < 1 hour Slow queries, partial failures, stale data
P3 Non-critical issues < 4 hours Dev cluster issues, non-critical job delays
P4 No user impact Next business day Monitoring gaps, cleanup needed

Instructions

Step 1: Quick Triage (Run First)

#!/bin/bash
set -euo pipefail
Related skills
Installs
29
GitHub Stars
2.2K
First Seen
Feb 14, 2026