Updated 8 hours ago
Chaos engineering is the practice of deliberately injecting failures into systems to test their resilience. Rather than waiting for systems to fail naturally and hoping your redundancy works, you proactively break things in controlled ways to verify that systems handle failures gracefully.
The philosophy is counterintuitive: the best way to build confidence in your system's reliability is to actively try to break it.
Why Break Things on Purpose?
Traditional testing validates that systems work correctly under expected conditions. Chaos engineering validates that systems degrade gracefully under unexpected conditions.
Your monitoring shows green dashboards and your tests pass. But this doesn't prove your system will survive real failures. Databases will crash, networks will partition, servers will run out of memory, dependencies will become unavailable. The question isn't whether failures will occur—it's whether your system handles them well when they do.
Here's the problem: your assumptions are probably wrong.
You assume that if your primary database fails, the replica will automatically promote. You assume that if one availability zone goes down, the other zones will handle the load. You assume your circuit breakers will trip before cascading failures take down everything.
Chaos engineering tests whether these assumptions are actually true. You find out you were wrong in a controlled experiment at 2pm on a Tuesday—not during a real outage at 3am on a Saturday.
The Chaos Engineering Process
1. Define Steady State
First, establish what "normal" looks like. What metrics indicate your system is healthy?
- Request success rate above 99.9%
- P99 latency below 200ms
- Error rate below 0.1%
- All health checks passing
The steady state is your hypothesis: "Under normal conditions, the system maintains these metrics."
2. Form a Hypothesis
Predict what will happen when you inject a failure:
"If we terminate one database instance, the system will automatically failover to the replica within 30 seconds, and users will experience no errors."
Or: "If we introduce 100ms of latency to the payment service, checkout completion rate should remain above 95%."
The hypothesis must be specific and measurable.
3. Inject Failure
Introduce the failure:
- Terminate instances
- Kill processes
- Introduce network latency
- Drop network packets
- Exhaust resources (CPU, memory, disk)
- Make dependencies unavailable
- Inject errors from databases or APIs
Start small. Don't immediately take down entire data centers.
4. Observe Results
Monitor your metrics during the experiment:
- Did steady state metrics remain within acceptable bounds?
- Did the system automatically recover?
- Were there cascading failures?
- Did monitoring and alerting work correctly?
- How long until recovery?
5. Learn and Improve
If the hypothesis was correct, you've built confidence in your resilience.
If the hypothesis was wrong, you've discovered a weakness before it caused a real outage. Fix it, then repeat the experiment to verify the fix works.
Finding problems is success, not failure.
Common Chaos Experiments
Instance Termination
Randomly terminate virtual machines or containers. This tests automatic instance replacement, load balancer health checks, application statelessness, and graceful shutdown.
Netflix's Chaos Monkey famously does this continuously in production, randomly killing instances throughout the day. Engineers know that any instance might die at any moment—so they build systems that handle it.
Network Failures
Latency injection adds delay to network calls. Does your application timeout correctly, or does it wait forever? Do slow dependencies cause cascading failures?
Packet loss drops a percentage of packets. Does your retry logic actually work?
Network partitions disconnect parts of your system from each other. Can your application handle dependency unavailability? Do distributed systems handle split-brain scenarios?
DNS failures make DNS resolution fail or return wrong results. Most applications assume DNS always works. They're wrong.
Resource Exhaustion
CPU saturation consumes all available CPU. Does monitoring alert? Does autoscaling trigger? Do resource limits prevent one service from affecting others?
Memory exhaustion tests memory leak detection, OOM killer behavior, and container restarts.
Disk filling tests whether applications handle write failures gracefully or corrupt data.
Connection pool exhaustion tests whether applications queue requests, reject them gracefully, or crash.
Dependency Failures
Service unavailability makes downstream services completely unavailable. Do circuit breakers trip? Does fallback logic work?
Slow dependencies makes dependencies respond very slowly. This is often worse than complete unavailability—slow responses tie up threads and connections, causing cascading failures.
Error injection makes dependencies return errors. Does your application handle partial failures?
Practicing Chaos Safely
Start Small
- Test in development first
- Then staging with production-like load
- Then production during low-traffic periods
- Finally, production during normal or peak traffic
Start with single-instance failures before attempting multi-instance or service-level failures.
Gradual Scope Expansion
Netflix's progression is instructive:
Chaos Monkey (2010): Randomly kills individual instances.
Chaos Kong (2015): Takes down entire availability zones or regions.
ChAP (Chaos Automation Platform): Orchestrates complex multi-failure scenarios.
This progression took five years. They built confidence and tooling incrementally.
Establish Guardrails
Blast radius limits ensure experiments can't affect too much at once. Experiment with 10% of traffic, or one availability zone, not everything.
Automatic abort stops experiments if metrics degrade beyond thresholds. If error rates spike above 5%, kill the experiment.
Exclude critical paths initially. Don't start chaos experiments by breaking payment processing during Black Friday.
Organizational Readiness
Chaos engineering requires maturity:
Observability must be excellent. You can't observe results without comprehensive metrics, logging, and tracing.
On-call readiness means engineers must be available to intervene if experiments go wrong.
Cultural safety allows acknowledging weaknesses without blame. Finding problems is the goal.
Incident response processes should be well-practiced. Chaos experiments that go wrong become real incidents.
Advanced Practices
Continuous Chaos
Rather than periodic experiments, inject failures continuously. Systems that handle constant low-level failures are more resilient than systems that only face failures during quarterly disaster recovery drills.
GameDays
Coordinated events where teams deliberately cause complex failures and practice incident response:
- Primary region fails during peak traffic
- Database replication fails causing data inconsistency
- Critical third-party API becomes unavailable
- Key personnel unavailable (testing runbooks and cross-training)
GameDays build muscle memory for real incidents.
Common Objections
"We can't risk breaking production." Real failures will break production anyway. Controlled chaos reveals weaknesses before uncontrolled failures do.
"Our monitoring will catch failures." Monitoring tells you when failures occur. It doesn't tell you whether your systems handle them gracefully.
"We'll test in staging." Staging rarely has production's scale, traffic patterns, or data characteristics. Some failures only manifest at scale.
"Our systems are too critical." Critical systems especially need chaos engineering. You can't afford to discover failure modes during real incidents.
Measuring Success
Success isn't measured by how many experiments pass—it's measured by how many weaknesses you discover and fix.
- Weaknesses found: Failure modes discovered through experiments
- Weaknesses fixed: Discovered issues that were resolved
- Mean time to recovery: Should improve over time
- Blast radius: Average impact of failures should decrease as resilience improves
Tools
Chaos Monkey (Netflix): The original—randomly terminates instances.
Gremlin: Commercial platform for various chaos experiments with safety controls.
Litmus: Chaos engineering for Kubernetes environments.
ToxiProxy: Simulates network conditions between services.
These tools lower the barrier to entry, providing safe frameworks for experiments.
The Cultural Shift
Chaos engineering represents a fundamental shift: from hoping systems are resilient to actively verifying it.
The most successful programs build culture around resilience:
- Blameless post-mortems when experiments uncover problems
- Celebration when weaknesses are found—that's the goal
- Resilience requirements in system design from the beginning
- Regular practice as part of normal operations, not special events
By deliberately breaking things in controlled ways, you build both confidence in your systems and muscle memory for handling real incidents. Your assumptions get tested. Your weaknesses get found. And when the real failures come—because they will—you're ready.
Frequently Asked Questions About Chaos Engineering
Was this page helpful?