Observability and Instrumentation

Overview

Code you can't observe is code you can't operate. Observability is the ability to answer "what is the system doing and why?" from the outside, using the telemetry the code emits. Instrumentation is not a post-launch add-on — it's written alongside the feature, the same way tests are. If a feature ships without telemetry, the first user-reported bug becomes archaeology instead of a query.

When to Use

Building any feature that will run in production
Adding a new service, endpoint, background job, or external integration
A production incident took too long to diagnose ("we couldn't tell what happened")
Setting up or reviewing alerting rules
Reviewing a PR that adds I/O, retries, queues, or cross-service calls

NOT for:

Diagnosing a failure happening right now — use the debugging-and-error-recovery skill (observability is what makes that skill fast next time)
Profiling and optimizing measured slowness — use the performance-optimization skill
Launch-day monitoring checklists and rollback triggers — see the shipping-and-launch skill; this skill covers the instrumentation that feeds them

observability-and-instrumentation

Observability and Instrumentation

Overview

When to Use