incident-response
Installation
SKILL.md
Incident Response and Remediation
Patterns for diagnosing and fixing production issues.
Healer Mode Workflow
- Investigate - Gather metrics, logs, and system state
- Diagnose - Identify root cause before fixing
- Fix - Implement minimal targeted fix
- Validate - Confirm metrics improve after deployment
- Document - Store learnings for future incidents
Tool Usage Priority
- Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
- Kubernetes Tools - Check pod status, events, deployments
- ArgoCD Tools - Verify GitOps sync status
- Memory Search - Look for similar past incidents
- Code Fix - Implement minimal targeted fix