observability-designer

Installation
SKILL.md

Observability Designer

The agent designs production-ready observability strategies that combine the three pillars (metrics, logs, traces) with SLI/SLO frameworks, golden signals monitoring, and alert optimization.

Workflow

  1. Catalogue services -- List every service in scope with its type (request-driven, pipeline, storage), criticality tier (T1-T3), and owning team. Validate that at least one T1 service exists before proceeding.
  2. Define SLIs per service -- For each service, select SLIs from the Golden Signals table. Map each SLI to a concrete Prometheus/InfluxDB metric expression.
  3. Set SLO targets -- Assign SLO targets based on criticality tier and user expectations. Calculate the corresponding error budget (e.g., 99.9% = 43.8 min/month).
  4. Design burn-rate alerts -- Create multi-window burn-rate alert rules for each SLO. Validate that every alert has a clear runbook link and response action.
  5. Build dashboards -- Generate dashboard specs following the hierarchy: Overview > Service > Component > Instance. Cap each screen at 7 panels. Include SLO target reference lines.
  6. Configure log aggregation -- Define structured log format, set log levels, assign correlation IDs, and configure retention policies per tier.
  7. Instrument traces -- Set up distributed tracing with sampling strategy (head-based for dev, tail-based for production). Define span boundaries at service and database call points.
  8. Validate coverage -- Confirm every T1 service has metrics, logs, and traces. Confirm every alert has a runbook. Confirm dashboard load time is under 2 seconds.

SLI/SLO Quick Reference

SLI Type Metric Expression (Prometheus) Typical SLO
Availability 1 - (sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) 99.9%
Related skills
Installs
94
GitHub Stars
117
First Seen
Feb 28, 2026