prometheus

Purpose

Prometheus is used for monitoring and alerting on metrics from various targets. It collects time-series data via HTTP pulls, stores it, and allows querying to trigger alerts.

When to Use

Use this skill when monitoring infrastructure, applications, or services in a DevOps/SRE environment. Apply it for real-time metrics collection, anomaly detection, or scaling decisions, such as tracking server health in Kubernetes clusters or alerting on high error rates in microservices.

Key Capabilities

Metrics Collection: Scrapes HTTP endpoints using configurable jobs; specify targets in YAML config, e.g., scrape_configs: - job_name: 'node' static_configs: - targets: ['localhost:9100'].
Querying: Use PromQL for data retrieval; example: query CPU usage with rate(node_cpu_seconds_total{mode="idle"}[5m]).
Alerting: Define rules in YAML files to fire alerts; e.g., groups: - name: example rules: - alert: HighCPU usage: (avg by(instance) (rate(node_cpu_seconds_total{mode="system"}[5m])) > 0.8) for: 1m.
Storage and Retention: Handles time-series data with configurable retention; set via --storage.tsdb.retention.time=15d flag.
Federation: Aggregate metrics from multiple Prometheus instances for larger setups.

Usage Patterns

To monitor a target, start by creating a YAML config file (e.g., prometheus.yml) with scrape jobs. Run the Prometheus server with that config. For querying, use the built-in API or integrate with tools like Grafana. Always set up alerting rules early. If using in a container, mount the config volume and expose the web port (default 9090). For production, enable authentication by setting --web.external-url and using basic auth with env vars like $PROMETHEUS_AUTH_USER and $PROMETHEUS_AUTH_PASS.