1. Library
  2. Performance and Reliability
  3. Reliability

Updated 10 hours ago

When designing redundant systems, you face a fundamental question: Should your backup systems wait for failure, or should they already be working?

Active-passive keeps backup systems idle until needed. Active-active puts all systems to work simultaneously. The choice shapes everything—availability, cost, complexity, and how your system behaves when things go wrong.

Active-Passive: The Backup Generator Model

In active-passive, one component handles all traffic while standbys wait. Your primary power source runs everything. The backup generator sits idle, starting only when primary power fails.

The active component handles all requests. Users interact exclusively with it.

The passive component stays synchronized—receiving data updates, maintaining current state—but serves no traffic. It's ready to become active but isn't working.

Failover happens when the active component fails. Detection systems recognize the failure, promote the passive component, and redirect users to it.

Where You See Active-Passive

Primary-standby databases: One instance handles all queries. A replica receives updates through replication but serves nothing. If the primary dies, the standby gets promoted.

Load balancer pairs: Two identical load balancers, but only one routes traffic. The second monitors the first. A virtual IP moves to the standby if the primary fails.

Network connections: Primary Internet connection handles everything. Secondary sits dormant until the primary goes down.

Why Active-Passive Works

Simplicity: One component processes traffic. No coordination needed.

State consistency: Only one writer means no synchronization conflicts.

Easier testing: You can verify the passive component without touching production. Bring it up, check it, return it to standby.

Lower licensing costs: Some software charges per active instance. Active-passive means one license for two systems.

Why Active-Passive Hurts

Wasted capacity: You're paying for infrastructure that does nothing. Need 100 servers? Provision 200, use 100.

Failover delay: Even automated failover takes time—30 seconds to several minutes while detection happens and the standby activates. That's downtime.

Failover risk: Failover is complex and infrequent. When it finally happens, things can go wrong. The passive component might not start correctly. Configuration might be stale. Some dependency might be missing.

You can't test failover without doing failover. And doing failover is the thing you're trying to avoid. Drills help, but they're not the real thing.

Capacity cliff: After failover, you're running on one component's capacity. If you sized for normal load, you can't handle peak load. You went from 200% capacity to 100% capacity at the worst possible moment.

Active-Active: Everyone Works

In active-active, all components serve traffic simultaneously. There's no distinction between primary and backup. Every component is primary.

Multiple workers all doing the same job at once. If one stops, the others continue without interruption—because they were already working.

All components process requests. Load balancers distribute traffic across everyone.

Failure handling is invisible. When a component fails, load balancers stop sending it traffic. Remaining components handle slightly more load. There's no failover process because everyone's already active.

Where You See Active-Active

Web servers: Ten servers all handle requests. If two fail, eight continue with slightly higher load each.

Multi-region deployments: Your application runs in three geographic regions simultaneously. Each serves nearby users. If one region fails, the other two absorb the traffic.

CDN nodes: Hundreds of edge locations all serving content. Nodes fail constantly. Users don't notice because other nodes keep serving.

Microservices: Multiple instances of each service run simultaneously. Instances come and go. The service continues.

Why Active-Active Works

No failover time: Components are already handling traffic. When one fails, others handle slightly more. Users might not notice anything happened.

Full utilization: You're using everything you're paying for. All 100 servers serve traffic, not just 50.

Graceful degradation: Losing one of ten servers means 10% less capacity, not 100%. Service quality degrades smoothly instead of falling off a cliff.

Continuous failover testing: The system constantly routes around failed components. You know failover works because it's working all the time.

Geographic distribution: Users connect to nearby instances. Lower latency, better experience.

Why Active-Active Hurts

Complexity: Coordinating multiple active instances is hard. State synchronization, data consistency, distributed coordination—all of it gets complicated.

Data consistency: Multiple components accepting writes must somehow agree on truth. Conflict resolution, distributed transactions, eventual consistency—pick your poison.

Session management: Either users stick to one server (session affinity) or session state must be shared. Both add complexity.

Higher infrastructure costs: All instances must handle their share of peak load. You can't rely on failover to spare capacity—everyone's already working.

Higher licensing costs: Per-active-instance pricing means paying for every component.

The Real Decision Framework

Active-passive is a bet that failure is rare. Active-active is a bet that failure is routine.

Both bets are correct—depending on your scale.

Bet on Rare Failure (Active-Passive) When:

  • Your team is small or lacks distributed systems expertise
  • Your application has complex state that's hard to share
  • Brief failover delays (30-60 seconds) are acceptable
  • Software licensing makes active-active cost-prohibitive
  • Writes must be strictly serialized
  • You're running a handful of components, not hundreds

Bet on Routine Failure (Active-Active) When:

  • Your SLOs demand near-zero failover time
  • Paying for idle capacity is economically untenable
  • Users in different regions need low latency
  • You run at scale where component failures happen daily
  • Your team can handle distributed systems complexity
  • Your application is stateless or has easily shareable state

Hybrid Patterns

Real systems often combine both:

Active-active regionally, active-passive globally: Multiple instances serve traffic within each region. If an entire region fails, traffic shifts to another region. Active-active where failures are routine (individual servers), active-passive where failures are rare (entire regions).

Active-passive for writes, active-active for reads: Writes go to a single primary. Reads distribute across replicas. Common for databases where write consistency matters but read scaling is needed.

Active-active with headroom: Run all components actively, but maintain enough spare capacity that survivors can absorb failures. Combines active-active's instant failover with active-passive's capacity assurance.

Implementation Reality

For Active-Passive:

Automate failover: Manual failover is too slow. Humans notice too late, react too slowly, make mistakes under pressure.

Test failover regularly: Periodic drills ensure the process works. Untested failover isn't failover—it's hope.

Monitor passive components: They're not serving traffic, but they'd better be healthy. An unhealthy standby means you have no redundancy.

Keep standbys synchronized: Replication lag equals data loss during failover.

Size standbys for peak load: After failover is exactly when you might face peak load. Size accordingly.

For Active-Active:

Accept eventual consistency: Perfect consistency across distributed instances is expensive or impossible. Design around it.

Load balance intelligently: Consider session affinity, geographic proximity, component health, and current load.

Handle partial failures: Some components being slow or unhealthy is normal. Handle it gracefully.

Build capacity headroom: If one instance failing overloads others, causing them to fail, you don't have redundancy—you have cascading failure waiting to happen.

The choice between active-active and active-passive isn't about which pattern is better. It's about which bet matches your reality. Small scale with rare failures? Active-passive's simplicity wins. Large scale with routine failures? Active-active's resilience wins. Most systems eventually need both, layered appropriately.

Frequently Asked Questions About Active-Active vs. Active-Passive

Was this page helpful?

😔
🤨
😃