kubespray-monitoring
Kubespray Cluster Monitoring
Overview
Production monitoring for Kubespray-deployed clusters uses a three-layer stack: NFS provisioner for persistent storage, kube-prometheus-stack for metrics collection and alerting, and etcd metrics exposure for cluster health visibility.
Core principle: Monitoring must cover all cluster layers - infrastructure (node-exporter), Kubernetes components (API server, scheduler, controller-manager), etcd (leader, WAL, peer latency), and workloads (pod metrics via kube-state-metrics).
When to Use
- Setting up Prometheus, Grafana, and Alertmanager on Kubespray clusters
- Deploying NFS storage provisioner for monitoring persistence
- Importing Grafana dashboards (community or custom)
- Enabling etcd metrics collection
- Writing PromQL queries for cluster health
Not for: Initial cluster deployment (use kubespray-deployment), HA configuration (use kubespray-ha-configuration), cluster upgrades (use kubespray-operations), troubleshooting failures (use kubespray-troubleshooting)
NFS Subdir External Provisioner
More from sigridjineth/kubespray-skills
rke2-operations
Use when managing RKE2 cluster certificates, performing manual or automated version upgrades, rotating TLS certificates, deploying the System Upgrade Controller, or troubleshooting RKE2 certificate and upgrade errors. Use when seeing "x509 certificate has expired" or "CertificateExpirationWarning" events or "Job has reached the specified backoff limit" errors.
3rke2-deployment
Use when deploying Kubernetes clusters with RKE2 (Rancher Kubernetes Engine 2), configuring server and agent nodes, managing built-in Helm chart addons, or setting up CIS-hardened clusters. Use when seeing "rke2-server failed to start" or "unable to join cluster" errors.
3kubeadm-troubleshooting
Use when kubeadm init fails, join fails, nodes show NotReady, pods stuck Pending, certificate errors, or kubelet crashlooping
3kubeadm-init
Use when initializing a Kubernetes control plane with kubeadm, setting up certificates, static pods, or troubleshooting init failures
2cluster-api
Use when managing Kubernetes clusters as Kubernetes resources with Cluster API (CAPI), provisioning workload clusters from a management cluster, performing declarative upgrades, or working with ClusterClass blueprints. Use when seeing "failed to connect to management cluster" or clusterctl errors.
2kubespray-airgap
Use when deploying Kubernetes in air-gapped or offline environments using kubespray-offline tool, setting up private container registries, staging binaries and images for offline use, configuring containerd registry mirrors, or troubleshooting image pull failures in isolated networks.
2