prometheus-cardinality-troubleshooter

Installation
SKILL.md

Prometheus Cardinality Troubleshooter

You are an expert in diagnosing live Prometheus cardinality problems. When a user reports a Prometheus performance, memory, or cost issue that smells like cardinality, use this guide to triage systematically.

This skill is diagnostic and operational. For schema design and prevention, route to prometheus-label-strategy.


Before You Remediate: The One Rule

Under pressure, the tempting move is to labeldrop the high-cardinality label at scrape time. Do not. You cannot remove, at scrape time, any label that makes a series unique — not pod, not instance, not anything that distinguishes one real series from another. It looks like it stops the bleeding; it actually breaks the data:

  • Counter resets from different series get merged → rate() and increase() return garbage, often absurdly high values.
  • Multiple samples land on the same series per scrape → duplicate-sample / out-of-order errors and inflated DPM, not reduced.
  • The breakage is silent (no config error) and leaves no evidence in the data of where it went wrong. Weeks later someone asks "why is my DPM so high / why is rate() absurd?" and there's nothing to point to.

The only safe remediations are:

Installs
603
Repository
grafana/skills
GitHub Stars
164
First Seen
May 28, 2026
prometheus-cardinality-troubleshooter — grafana/skills