loki-best-practices
Installation
SKILL.md
loki-best-practices
Production-grade Grafana Loki operations: diagnosis, performance, storage, HA, and observability. Assume the user is a senior engineer running Loki in Kubernetes — skip introductions to LogQL and chunks and go straight to evidence-driven SRE workflow.
When to use
Trigger on operational Loki tasks:
- Ingest incidents: HTTP 429/4xx from distributor,
loki_discarded_samples_totalrising, push retries from Promtail/Alloy/OTel, "rate limit exceeded", "stream limit", "out of order", "too far behind", "line too long", "invalid labels". - WAL & flush failures: ingester CrashLoopBackOff during replay,
loki_ingester_wal_replay_activestuck at 1,loki_ingester_chunks_flushed_totalflat while ingest is hot, OOMKilled on startup, full WAL PVC. - Deployment & config: Helm chart
singleBinaryvssimple-scalablemode collisions, "schema period must be 24h", TSDB migration that hides pre-migration data, missing object-storage backend on scalable targets, replication-factor vs ingester-count mismatch. - Integration breakage: Grafana datasource "No data" with active streams, multi-tenant
X-Scope-OrgIDreturning partial results, Promtail positions lost on restart, OTel structured metadata rejected, OTelendpointmisconfigured (/loki/api/v1/push/v1/logs404), memberlist DNS failures, NetworkPolicies blocking gossip/HTTP/gRPC. - Query performance: "too many outstanding requests",
max_query_lengthexceeded,max_query_seriesreached, high-cardinality label explosion, expensive|~regex without prior|=line filter, sharding/parallelism tuning. - Storage: S3
WebIdentityErr/ IRSA failures, S3 lifecycle deleting active chunks, compactor stuck or running as two replicas,context deadline exceededon object storage, index-gateway slow chunk reads. - HA: UNHEALTHY ring entries after node loss, ungraceful shutdown leaving stranded chunks, schema bump during rolling upgrade, two compactors running simultaneously, ruler firing in duplicate or not at all, memberlist IP reuse / ghost members.
- Observability of Loki itself: which metrics to alert on, what to scrape, what cache hit ratios to target.
Do NOT trigger for: