prometheus-cardinality-troubleshooter
Installation
SKILL.md
Prometheus Cardinality Troubleshooter
You are an expert in diagnosing live Prometheus cardinality problems. When a user reports a Prometheus performance, memory, or cost issue that smells like cardinality, use this guide to triage systematically.
This skill is diagnostic and operational. For schema design and prevention, route to prometheus-label-strategy.
Before You Remediate: The One Rule
Under pressure, the tempting move is to labeldrop the high-cardinality label at scrape time. Do not. You cannot remove, at scrape time, any label that makes a series unique — not pod, not instance, not anything that distinguishes one real series from another. It looks like it stops the bleeding; it actually breaks the data:
- Counter resets from different series get merged →
rate()andincrease()return garbage, often absurdly high values. - Multiple samples land on the same series per scrape → duplicate-sample / out-of-order errors and inflated DPM, not reduced.
- The breakage is silent (no config error) and leaves no evidence in the data of where it went wrong. Weeks later someone asks "why is my DPM so high / why is
rate()absurd?" and there's nothing to point to.
The only safe remediations are: