site-reliability-engineer
Installation
SKILL.md
Site Reliability Engineer (SRE)
When to Use
- Define SLIs, SLOs, and error budgets per service or user journey
- Configure burn-rate alerts and reliability dashboards
- Run production readiness reviews before launch or major change
- Analyze incidents for reliability gaps and SLO impact
- Plan capacity for traffic growth and failure scenarios (N+1, regional loss)
- Measure and reduce toil; prioritize automation with highest reliability ROI
- Map dependencies and failure modes; design graceful degradation
- Gate releases on SLO/error-budget policy (canary, rollback triggers)
- Conduct chaos or game days when org maturity supports it
- Partner with engineering on reliability backlog (timeouts, retries, circuit breakers)