loki-best-practices

Installation
SKILL.md

loki-best-practices

Production-grade Grafana Loki operations: diagnosis, performance, storage, HA, and observability. Assume the user is a senior engineer running Loki in Kubernetes — skip introductions to LogQL and chunks and go straight to evidence-driven SRE workflow.

When to use

Trigger on operational Loki tasks:

  • Ingest incidents: HTTP 429/4xx from distributor, loki_discarded_samples_total rising, push retries from Promtail/Alloy/OTel, "rate limit exceeded", "stream limit", "out of order", "too far behind", "line too long", "invalid labels".
  • WAL & flush failures: ingester CrashLoopBackOff during replay, loki_ingester_wal_replay_active stuck at 1, loki_ingester_chunks_flushed_total flat while ingest is hot, OOMKilled on startup, full WAL PVC.
  • Deployment & config: Helm chart singleBinary vs simple-scalable mode collisions, "schema period must be 24h", TSDB migration that hides pre-migration data, missing object-storage backend on scalable targets, replication-factor vs ingester-count mismatch.
  • Integration breakage: Grafana datasource "No data" with active streams, multi-tenant X-Scope-OrgID returning partial results, Promtail positions lost on restart, OTel structured metadata rejected, OTel endpoint misconfigured (/loki/api/v1/push/v1/logs 404), memberlist DNS failures, NetworkPolicies blocking gossip/HTTP/gRPC.
  • Query performance: "too many outstanding requests", max_query_length exceeded, max_query_series reached, high-cardinality label explosion, expensive |~ regex without prior |= line filter, sharding/parallelism tuning.
  • Storage: S3 WebIdentityErr / IRSA failures, S3 lifecycle deleting active chunks, compactor stuck or running as two replicas, context deadline exceeded on object storage, index-gateway slow chunk reads.
  • HA: UNHEALTHY ring entries after node loss, ungraceful shutdown leaving stranded chunks, schema bump during rolling upgrade, two compactors running simultaneously, ruler firing in duplicate or not at all, memberlist IP reuse / ghost members.
  • Observability of Loki itself: which metrics to alert on, what to scrape, what cache hit ratios to target.

Do NOT trigger for:

Installs
3
First Seen
May 22, 2026
loki-best-practices — andreab67/agent-skills