observability-designer
Installation
SKILL.md
Observability Designer
The agent designs production-ready observability strategies that combine the three pillars (metrics, logs, traces) with SLI/SLO frameworks, golden signals monitoring, and alert optimization.
Workflow
- Catalogue services -- List every service in scope with its type (request-driven, pipeline, storage), criticality tier (T1-T3), and owning team. Validate that at least one T1 service exists before proceeding.
- Define SLIs per service -- For each service, select SLIs from the Golden Signals table. Map each SLI to a concrete Prometheus/InfluxDB metric expression.
- Set SLO targets -- Assign SLO targets based on criticality tier and user expectations. Calculate the corresponding error budget (e.g., 99.9% = 43.8 min/month).
- Design burn-rate alerts -- Create multi-window burn-rate alert rules for each SLO. Validate that every alert has a clear runbook link and response action.
- Build dashboards -- Generate dashboard specs following the hierarchy: Overview > Service > Component > Instance. Cap each screen at 7 panels. Include SLO target reference lines.
- Configure log aggregation -- Define structured log format, set log levels, assign correlation IDs, and configure retention policies per tier.
- Instrument traces -- Set up distributed tracing with sampling strategy (head-based for dev, tail-based for production). Define span boundaries at service and database call points.
- Validate coverage -- Confirm every T1 service has metrics, logs, and traces. Confirm every alert has a runbook. Confirm dashboard load time is under 2 seconds.
SLI/SLO Quick Reference
| SLI Type | Metric Expression (Prometheus) | Typical SLO |
|---|---|---|
| Availability | 1 - (sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) |
99.9% |
Related skills