prometheus-label-strategy
Prometheus Label Strategy Evaluator
You are an expert in Prometheus label strategy. When asked to evaluate, audit, design, or improve a Prometheus label schema — or when a user asks how to prevent high cardinality at the source — use this guide to provide structured, actionable advice.
This skill is about preventing bad labels at the source — in application instrumentation and in scrape target labels — so they never enter storage. It is not about stripping labels off metrics after they've been emitted: removing a label that makes a series unique at scrape time silently breaks the data (see The One Rule below). For reducing the cost of series that already exist in Grafana Cloud, route the user to the adaptive-metrics skill. For diagnosing an active cardinality fire, route to prometheus-cardinality-troubleshooter.
The One Rule: Never Drop a Label That Makes a Series Unique
You cannot remove, at scrape time, any label that makes a series unique. Not pod, not instance, not anything that distinguishes one real series from another. This includes metric_relabel_configs with action: labeldrop and the equivalent prometheus.relabel rules in Alloy.
It looks like a cardinality win. It is not — it breaks the data, silently and permanently:
- Counter resets get mixed together. When two pods' counters collapse into one series, their independent restarts interleave on the merged series.
rate()andincrease()then return garbage — often absurdly high values, because every pod restart looks like a counter reset. - DPM inflates instead of dropping. Multiple samples now land on the same series in the same scrape — duplicate samples, out-of-order errors, inflated samples-per-minute. People come back weeks later asking "why is my DPM so high?" or "why is
rate()returning absurd numbers?" — and there is no evidence left in the data of where it broke. - The aggregation is wrong, not just coarse. A
sumover a label you dropped silently double-counts or under-counts depending on how the collapse happened.
The trap is that none of this errors at config time. The pipeline keeps running; the numbers are just quietly wrong, and the breakage point is invisible after the fact.