sre
Installation
SKILL.md
Cluster access (
--contextpatterns) and internal service URLs are in thek8sskill.
Debugging Kubernetes Incidents
Core Principles
- 5 Whys Analysis — NEVER stop at symptoms. Ask "why" until you reach the root cause.
- Multi-Source Correlation — Combine logs, events, metrics for a complete picture.
- Zero Alert Tolerance — Every firing alert must be addressed: fix the root cause, or as a last resort, create a declarative Silence CR with justification. Never ignore or defer.
The 5 Whys Analysis (CRITICAL)
Apply 5 Whys before concluding any investigation. Stopping at symptoms leads to ineffective fixes.
Example:
Symptom: Helm install failed with "context deadline exceeded"
Why #1: Pods never became Ready
Related skills
More from ionfury/homelab
prometheus
Query Prometheus API for cluster metrics, alerts, and observability data. Use when investigating cluster health, performance issues, resource utilization, or alert status. Triggers on questions like "what's the CPU usage", "show me firing alerts", "check memory pressure", "query prometheus for", or any PromQL-related requests.
68taskfiles
|
63opentofu-modules
|
59terragrunt
|
59k8s
|
46cnpg-database
|
38