sre

Installation
SKILL.md

Cluster access (--context patterns) and internal service URLs are in the k8s skill.

Debugging Kubernetes Incidents

Core Principles

  • 5 Whys Analysis — NEVER stop at symptoms. Ask "why" until you reach the root cause.
  • Multi-Source Correlation — Combine logs, events, metrics for a complete picture.
  • Zero Alert Tolerance — Every firing alert must be addressed: fix the root cause, or as a last resort, create a declarative Silence CR with justification. Never ignore or defer.

The 5 Whys Analysis (CRITICAL)

Apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.

Example:

Symptom: Helm install failed with "context deadline exceeded"

Why #1: Pods never became Ready
Related skills
Installs
29
Repository
ionfury/homelab
GitHub Stars
23
First Seen
Feb 25, 2026