Reliability and Uptime

Active-Active vs. Active-Passive

Active-passive bets that failure is rare. Active-active bets that failure is routine. Both bets are correct—depending on your scale. Here's how to choose.

Chaos Engineering

Your assumptions about system resilience are probably wrong. Chaos engineering is how you find out before your customers do.

Error Budgets

Error budgets flip reliability engineering on its head: instead of preventing all failure, they give you a failure allowance and ask whether you're spending it wisely.

Redundancy and Failover

Every system fails eventually. Redundancy buys you time between failure and impact. Failover determines how much time you actually get.

SLAs and SLOs

SLOs define what 'good enough' means for your service. SLAs are the promises you make when money is on the line. Error budgets are the insight that makes both useful.

SLIs (Service Level Indicators)

SLIs measure what users actually experience—not what your servers report. Learn to choose metrics that reveal reality, not comfort.

The Nines of Availability

Each nine in availability represents ten times less downtime—and exponentially more organizational transformation. A guide to what each level actually demands.

What Does 99.9% Uptime Mean?

99.9% sounds nearly perfect. It's actually 8 hours and 46 minutes of annual downtime—and the number hides even more than it reveals.

Works on My Machine

Every 'works on my machine' bug is an invisible assumption—something you didn't know you were depending on. Here's how to see them before production does.

Was this page helpful?

😔

🤨

😃