Investigating metric anomalies

The job: go from a metric symptom ("ingestion lag is rising") to a probable cause with evidence, fast. The metric tells you what and when; logs and traces tell you why. Follow the loop below — it front-loads the cheap, high-information calls and only fans out when the blast radius is unclear.

The loop

1. Pin down the metric

If you have the exact metric name, skip ahead. Otherwise call metric-names-list with a substring from the symptom (lag, error, latency, queue). The returned metric_type decides the lens: counters (sum) are only meaningful as rate/increase, gauges as avg, histograms as histogram_quantile.

2. Characterize first — one call, three answers

Call characterize-metric-anomaly with the metric name and anomalyFrom (the alert fire time, or when the user says it started looking wrong; subtract some margin if unsure). It compares against the preceding window by default and answers:

How bad: direction, change_ratio, anomaly_peak vs baseline_mean. If direction is flat, your window or metric is wrong — widen the window, or compare against the same window yesterday via baselineFrom/baselineTo (daily-pattern metrics often look "anomalous" against the immediately-preceding hours).
When: onset_time — treat this timestamp as the pivot for everything that follows.
Where: top_movers — label values whose behavior changed. One mover (a single pod, shard, or endpoint) means a localized culprit; everything moving together means a shared cause (an upstream dependency, a deploy, infra).

investigating-metric-anomalies

Investigating metric anomalies

The loop

1. Pin down the metric

2. Characterize first — one call, three answers

3. Sharpen with targeted metric queries