sre-engineer

Installation
Summary

SRE practices for defining SLOs, managing error budgets, automating toil, and building resilient production systems.

  • Defines quantitative SLOs with SLI measurements, calculates error budgets, and enforces burn-rate policies to balance reliability with feature velocity
  • Provides golden signal monitoring (latency, traffic, errors, saturation) with multiwindow burn-rate alerting rules and PromQL query templates
  • Includes automation patterns for toil reduction, chaos engineering test design, and incident response runbooks with blameless postmortem guidance
  • Delivers concrete Prometheus configurations, Python remediation scripts, and capacity planning workflows ready for production deployment
SKILL.md

SRE Engineer

Core Workflow

  1. Assess reliability - Review architecture, SLOs, incidents, toil levels
  2. Define SLOs - Identify meaningful SLIs and set appropriate targets
  3. Verify alignment - Confirm SLO targets reflect user expectations before proceeding
  4. Implement monitoring - Build golden signal dashboards and alerting
  5. Automate toil - Identify repetitive tasks and build automation
  6. Test resilience - Design and execute chaos experiments; verify recovery meets RTO/RPO targets before marking the experiment complete; validate recovery behavior end-to-end

Reference Guide

Load detailed guidance based on context:

Topic Reference Load When
SLO/SLI references/slo-sli-management.md Defining SLOs, calculating error budgets
Error Budgets references/error-budget-policy.md Managing budgets, burn rates, policies
Related skills

More from jeffallan/claude-skills

Installs
2.1K
GitHub Stars
9.0K
First Seen
Jan 21, 2026