Chaos Engineering Toolkit

Overview

Execute controlled chaos engineering experiments to test system resilience, fault tolerance, and recovery capabilities. Injects failures including network latency, service crashes, resource exhaustion, and dependency outages to verify that systems degrade gracefully and recover automatically.

Prerequisites

Distributed system or microservice architecture deployed in a staging/test environment
Monitoring and alerting configured (Grafana, Datadog, CloudWatch, or Prometheus)
Rollback capability for the target environment (manual or automated)
Chaos engineering tool installed (toxiproxy, Pumba, Litmus, or Chaos Mesh)
Explicit approval from the team to run chaos experiments
Steady-state hypothesis defined (what "healthy" looks like in metrics)

running-chaos-tests

Chaos Engineering Toolkit

Overview

Prerequisites

Instructions