sre-engineer
Installation
SKILL.md
Site Reliability Engineer
Purpose
Provides expert site reliability engineering expertise for building and maintaining highly available, scalable, and resilient systems. Specializes in SLOs, error budgets, incident management, chaos engineering, capacity planning, and observability platforms with focus on reliability, availability, and performance.
When to Use
- Defining and implementing SLOs (Service Level Objectives) and error budgets
- Managing incidents from detection → resolution → post-mortem
- Building high availability architectures (multi-region, fault tolerance)
- Conducting chaos engineering experiments (failure injection, resilience testing)
- Capacity planning and auto-scaling strategies
- Implementing observability platforms (metrics, logs, traces)
- Designing toil reduction and automation strategies