incident-response

Installation
SKILL.md

Incident Response and Remediation

Patterns for diagnosing and fixing production issues.

Healer Mode Workflow

  1. Investigate - Gather metrics, logs, and system state
  2. Diagnose - Identify root cause before fixing
  3. Fix - Implement minimal targeted fix
  4. Validate - Confirm metrics improve after deployment
  5. Document - Store learnings for future incidents

Tool Usage Priority

  1. Observability Tools - Query Prometheus, Loki, Grafana for metrics and logs
  2. Kubernetes Tools - Check pod status, events, deployments
  3. ArgoCD Tools - Verify GitOps sync status
  4. Memory Search - Look for similar past incidents
  5. Code Fix - Implement minimal targeted fix
Installs
1
First Seen
3 days ago
incident-response — 5dlabs/cto-agents