monitoring-observability
Monitoring & Observability
Core Workflow: Observability Implementation
Use this decision tree to determine your starting point:
Are you setting up monitoring from scratch?
├─ YES → Start with "1. Design Metrics Strategy"
└─ NO → Do you have an existing issue?
├─ YES → Go to "9. Troubleshooting & Analysis"
└─ NO → Are you improving existing monitoring?
├─ Alerts → Go to "3. Alert Design"
├─ Dashboards → Go to "4. Dashboard & Visualization"
├─ SLOs → Go to "5. SLO & Error Budgets"
├─ Tool selection → Read references/tool_comparison.md
└─ Using Datadog? High costs? → Go to "7. Datadog Cost Optimization & Migration"
More from ahmedasmar/devops-claude-skills
ci-cd
CI/CD pipeline design, optimization, DevSecOps security scanning, and troubleshooting. Use this skill whenever the user mentions CI/CD, GitHub Actions, GitLab CI, pipelines, workflows, builds, or DevSecOps. Triggers include creating new CI/CD workflows, debugging pipeline failures or flaky tests, implementing SAST/DAST/SCA security scanning, optimizing slow builds with caching or parallelization, setting up deployment workflows, securing pipelines with OIDC or secrets management, implementing matrix builds or test sharding, and troubleshooting Docker, permissions, or timeout issues.
216aws-cost-finops
AWS cost optimization and FinOps workflows. Use for finding unused resources, analyzing Reserved Instance opportunities, detecting cost anomalies, rightsizing instances, evaluating Spot instances, migrating to newer generation instances, implementing FinOps best practices, optimizing storage/network/database costs, and managing cloud financial operations. Includes automated analysis scripts and comprehensive reference documentation.
105aws-cost-optimization
AWS cost optimization and FinOps workflows. Use this skill whenever the user mentions AWS costs, cloud spending, FinOps, Reserved Instances, Savings Plans, or cost reduction. Triggers include finding unused resources, analyzing the AWS bill, rightsizing EC2 or RDS instances, evaluating Spot instances, detecting cost anomalies, migrating to Graviton or newer instance generations, implementing tagging for cost allocation, setting up AWS Budgets, conducting monthly cost reviews, comparing RI vs Savings Plans, and optimizing storage, network, or database costs.
40k8s-troubleshooter
Systematic Kubernetes troubleshooting and incident response. Use this skill whenever the user mentions Kubernetes, K8s, kubectl, pods, containers, or clusters. Triggers include diagnosing CrashLoopBackOff, ImagePullBackOff, OOMKilled, or Pending pods, responding to production incidents, troubleshooting node NotReady or DiskPressure, debugging service connectivity or networking, investigating PVC or storage failures, analyzing performance degradation, checking cluster health, troubleshooting Helm releases, and conducting post-incident reviews.
13gitops-workflows
GitOps deployment workflows with ArgoCD and Flux. Use this skill whenever the user mentions GitOps, ArgoCD, Flux, Flagger, Argo Rollouts, or continuous deployment to Kubernetes. Triggers include setting up ArgoCD or Flux from scratch, designing Git repository structures (monorepo vs polyrepo, app-of-apps), deploying to multiple clusters with ApplicationSets, managing secrets in Git (SOPS, Sealed Secrets, External Secrets Operator), implementing canary or blue-green deployments, troubleshooting sync or reconciliation issues, working with OCI artifacts, and comparing ArgoCD vs Flux.
12iac-terraform
Infrastructure as Code with Terraform and Terragrunt. Use this skill whenever the user mentions Terraform, Terragrunt, HCL, or infrastructure as code. Triggers include writing or reviewing .tf files, creating reusable modules, debugging terraform plan/apply errors, managing remote state and locks, fixing state drift, setting up CI/CD for Terraform, scaffolding new modules, validating module structure, and implementing Terragrunt DRY patterns across environments.
12