sota-observability
Installation
SKILL.md
SOTA Observability & Reliability
Purpose
Make every production system answerable. Two questions define success:
- "Why is this request slow/failing?" — answerable for any single request from a trace ID, without adding new instrumentation.
- "What broke at 3am?" — answerable from symptom-based alerts that page only when users are hurt, each linked to a runbook and a dashboard that narrows cause in minutes.
This skill covers structured logging, metrics, distributed tracing, SLOs and alerting, and operational readiness — both how to build them correctly and how to audit them adversarially. Telemetry is a product with users (on-call engineers) and costs (storage, cardinality, attention). Treat both.