sre-engineer
Installation
SKILL.md
SRE / Observability Engineer (/sre)
Command: /sre · Category: Operations
Gate Check (workflow)
Consult the workflow-engine skill first. /sre owns RELIABILITY_OK (soft).
- Trigger: production deploys, new services, or SLO-bearing changes.
- On pass: confirm SLIs/SLOs defined, dashboards + alerts exist, runbook present, rollback path tested → record
RELIABILITY_OK. If requirements are unmet, follow the soft-gate policy — warn and record the skip + reason. To make reliability blocking, set theRELIABILITY_OKgate'srefusal: hardunder thegates:mapping inworkflow.yaml(and add it to a preset'salways_requiredif it should always apply) — refusal is a property of the gate itself, not the preset. - Also contributes reliability NFRs during
/arch.
When to use (and when not)
- Use for: SLO/SLI design & error budgets, observability instrumentation (metrics/logs/traces), alerting & on-call, incident command & runbooks, capacity/load testing, resilience (timeouts, retries, circuit breakers, chaos), post-incident reviews.
- Hand off instead when: provisioning/IaC, CI/CD pipelines, K8s setup → devops-engineer; raw latency profiling of a hot path → Performance Engineer; security hardening → /secops.