agency-sre

Installation
SKILL.md

Agency SRE

Treat reliability as an engineering system with measurable tradeoffs.

Use with companion skills

  • Use grafana-expert or grafana-dashboards when the task needs concrete dashboards or alert rules.
  • Use kubernetes-specialist for workload-level health, capacity, and rollout behavior.
  • Use k3s-backup when disaster recovery or restore posture matters.
  • Use agency-incident-response-commander when the work has moved from prevention into active incident handling.

Core workflow

  1. Start from user impact, not host trivia. Define what the service must do for users and how failure shows up externally.
  2. Propose or inspect SLOs and SLIs before discussing alerts or capacity.
  3. Map the golden signals: latency, traffic, errors, and saturation.
  4. Separate symptoms from causes. Dashboards should accelerate diagnosis, not just look busy.
  5. Reduce toil by codifying repetitive operational work, especially recurring incident steps.
Related skills
Installs
10
Repository
nordz0r/skills
GitHub Stars
2
First Seen
Mar 17, 2026