cloud-monitoring
Cloud Monitoring
This skill enables the agent to design and configure comprehensive monitoring and observability solutions for cloud infrastructure and applications. The agent understands the three pillars of observability — metrics, logs, and traces — and can set up dashboards, alerting rules, SLIs, SLOs, and SLAs using tools like Prometheus, Grafana, CloudWatch, Datadog, and OpenTelemetry. The agent also applies alerting best practices to minimize alert fatigue while ensuring critical issues are surfaced promptly.
Workflow
-
Identify Monitoring Objectives: The agent works with the user to define what needs to be monitored and why. This includes identifying critical services, establishing Service Level Indicators (SLIs) such as request latency, error rate, and throughput, and setting Service Level Objectives (SLOs) that define acceptable performance thresholds. SLAs (Service Level Agreements) are documented as contractual commitments to customers.
-
Select Monitoring Tools and Instrumentation: Based on the cloud provider and application architecture, the agent recommends an appropriate monitoring stack. This may include Prometheus for metrics collection, Grafana for visualization, Loki or CloudWatch Logs for log aggregation, and Jaeger or AWS X-Ray for distributed tracing. The agent configures OpenTelemetry SDKs in application code to emit standardized telemetry data.
-
Configure Metrics Collection and Dashboards: The agent defines and deploys metric scrapers, exporters, and custom metrics. It builds dashboards that visualize the golden signals (latency, traffic, errors, saturation) and infrastructure metrics (CPU, memory, disk, network). Dashboards are organized by service tier so teams can quickly triage issues.
-
Establish Alerting Rules: The agent configures alerts that trigger on meaningful conditions — such as error budget burn rate exceeding thresholds, sustained latency spikes, or pod restarts — rather than raw metric thresholds alone. Multi-window, multi-burn-rate alerting is used to balance detection speed with false-positive suppression. Alert routing is configured to send critical alerts to PagerDuty or Opsgenie and warnings to Slack.
-
Set Up Log Aggregation and Trace Correlation: The agent configures centralized log collection with structured logging formats (JSON), log retention policies, and log-based alerts for error patterns. Distributed traces are correlated with logs and metrics using shared trace IDs so that a single alert can link directly to the relevant request trace and log entries.
-
Review and Iterate: The agent periodically audits alert noise levels, dashboard usage, and SLO compliance. Unused alerts are pruned, thresholds are adjusted based on observed baselines, and new services are onboarded into the monitoring stack as the system evolves.
Supported Technologies
More from seb1n/awesome-ai-agent-skills
summarization
Summarize text using extractive, abstractive, hierarchical, and multi-document techniques, producing concise outputs at configurable detail levels.
23proofreading
Proofread and correct text for grammar, spelling, punctuation, style, clarity, and consistency, with support for multiple style guides and readability analysis.
19note-taking
Capture, organize, and retrieve notes efficiently using structured formats, tagging, and file management for meetings, ideas, research, and daily logs.
18knowledge-graph-creation
Build structured knowledge graphs from unstructured text by extracting entities, mapping relationships, generating graph triples, and visualizing the result.
17data-analysis
Analyze datasets to extract insights through statistical methods, trend identification, hypothesis testing, and correlation analysis.
14data-visualization
Create clear, effective charts and dashboards from structured data using matplotlib, seaborn, and plotly.
14