investigating-metric-anomalies
Investigating metric anomalies
The job: go from a metric symptom ("ingestion lag is rising") to a probable cause with evidence, fast. The metric tells you what and when; logs and traces tell you why. Follow the loop below — it front-loads the cheap, high-information calls and only fans out when the blast radius is unclear.
The loop
1. Pin down the metric
If you have the exact metric name, skip ahead. Otherwise call metric-names-list with a substring from the symptom (lag, error, latency, queue). The returned metric_type decides the lens: counters (sum) are only meaningful as rate/increase, gauges as avg, histograms as histogram_quantile.
2. Characterize first — one call, three answers
Call characterize-metric-anomaly with the metric name and anomalyFrom (the alert fire time, or when the user says it started looking wrong; subtract some margin if unsure). It compares against the preceding window by default and answers:
- How bad:
direction,change_ratio,anomaly_peakvsbaseline_mean. Ifdirectionisflat, your window or metric is wrong — widen the window, or compare against the same window yesterday viabaselineFrom/baselineTo(daily-pattern metrics often look "anomalous" against the immediately-preceding hours). - When:
onset_time— treat this timestamp as the pivot for everything that follows. - Where:
top_movers— label values whose behavior changed. One mover (a single pod, shard, or endpoint) means a localized culprit; everything moving together means a shared cause (an upstream dependency, a deploy, infra).