running-chaos-tests

Installation
SKILL.md

Chaos Engineering Toolkit

Overview

Execute controlled chaos engineering experiments to test system resilience, fault tolerance, and recovery capabilities. Injects failures including network latency, service crashes, resource exhaustion, and dependency outages to verify that systems degrade gracefully and recover automatically.

Prerequisites

  • Distributed system or microservice architecture deployed in a staging/test environment
  • Monitoring and alerting configured (Grafana, Datadog, CloudWatch, or Prometheus)
  • Rollback capability for the target environment (manual or automated)
  • Chaos engineering tool installed (toxiproxy, Pumba, Litmus, or Chaos Mesh)
  • Explicit approval from the team to run chaos experiments
  • Steady-state hypothesis defined (what "healthy" looks like in metrics)

Instructions

  1. Define the steady-state hypothesis:
    • Identify measurable indicators of normal system behavior (e.g., p99 latency < 500ms, error rate < 0.1%, all health checks pass).
    • Record baseline metrics before injecting any failures.
Related skills
Installs
26
GitHub Stars
2.2K
First Seen
Feb 1, 2026