Production Monitoring and Observability

This skill encodes battle-tested observability patterns for production services. Every recommendation comes from real incidents — the ones where you stared at a dashboard that showed nothing useful while users were screaming. Observability is not a feature you bolt on after launch. It is the foundation you build on from day one.

1. The Three Pillars of Observability

Observability is not "having logs." It is the ability to ask arbitrary questions about your system's behavior without deploying new code. The three pillars work together — none is sufficient alone.

Pillar	What It Tells You	Example
Logs	What happened — discrete events with context	"User X login failed: expired token"
Metrics	How the system behaves now — aggregated numbers over time	"p99 latency is 450ms and rising"
Traces	Why something is slow — a request's journey across services	"Postgres query in user-service took 2.3s"

How they connect: An alert fires on a metric (error rate > 1%). You filter logs by the time window to see what errors occurred. You grab a trace ID from the logs and follow the trace to the slow service. You fix it and verify the metric recovers. Without all three, you are flying blind.

production-monitoring

Production Monitoring and Observability

1. The Three Pillars of Observability

More from vstorm-co/production-stack-skills

production

production-check

production-review

production-fastapi

production-docker

production-postgres