Geographic Redundancy

Updated 8 hours ago

Geographic redundancy means putting copies of your system in different physical locations—different buildings, cities, or continents. When one location fails, others keep running.

This sounds simple. It isn't.

The Problem It Solves

Having redundant servers in the same data center protects against server failures. It doesn't protect against the data center losing power, flooding, or catching fire. Everything in that building shares the same fate.

Regional failures happen more often than you'd expect:

Power grid failures took out 50 million people in the northeastern US in 2003, including countless data centers. Hurricane Sandy flooded New York data centers in 2012, causing extended outages for services without geographic redundancy. Fiber cuts can isolate entire regions even when servers remain operational.

Your servers might be perfect. The building they're in might not be.

The Levels

Geographic redundancy exists at different scales, each trading cost and complexity for protection:

Multi-availability-zone is the minimum for production cloud systems. Cloud providers divide regions into availability zones—physically separate data centers with independent power and networking, typically within 100 kilometers. Latency between zones is 1-5ms. This protects against facility failures but not regional disasters.

Multi-region distributes across distant locations—different states, countries, or continents. Latency jumps to 50-300ms depending on distance. You're protected against regional disasters, but you've entered a different world of complexity.

Multi-cloud spans different providers entirely. This protects against provider-wide outages but adds enormous operational complexity. Few organizations need this except for the most critical systems.

The Real Challenge: Data

Distributing compute is straightforward. Distributing data is where geographic redundancy gets hard.

The problem: a user in New York writes data. A user in London reads it. What happens?

Synchronous replication waits for all regions to acknowledge writes before completing. The New York write doesn't finish until London confirms it received the data. This guarantees consistency but adds 150ms of latency to every write—the time it takes light to cross the Atlantic and back.

Asynchronous replication completes the write in New York immediately and sends it to London in the background. Fast, but there's a window where New York and London have different data. If New York fails before replication completes, those writes are gone.

The CAP theorem states you can't have consistency, availability, and partition tolerance simultaneously. Geographic distribution creates network partitions by definition—the Atlantic Ocean is a permanent partition. You have to choose: do you want writes to be slow, or do you want regions to sometimes disagree?

This isn't academic. It's the universe telling you that information can't travel faster than light, and you have to pick which lie you're comfortable telling your users.

Active-Active vs. Active-Passive

Two fundamental architectures:

Active-passive: One region handles all traffic. Other regions stay synchronized but idle. When the primary fails, traffic shifts to a passive region.

Simpler because you don't need to solve distributed writes—there's only one place that accepts them. But you're paying for capacity that sits unused, and failover takes time because the passive region needs to warm up.

Active-active: All regions serve traffic. Users connect to nearby regions. When one fails, others absorb its traffic.

Better performance, faster failover, more efficient capacity use. But now you need to solve the hard problem: what happens when two regions modify the same data simultaneously?

Conflict resolution strategies range from crude to sophisticated:

Last-write-wins: Simple but lossy. Whoever wrote last erases the other write.
Application-defined merge: Your code decides how to combine conflicting updates. Complex but precise.
CRDTs (Conflict-Free Replicated Data Types): Data structures designed so concurrent updates can always merge without conflicts. Elegant but constrains what data structures you can use.

Getting Users to the Right Region

DNS-based routing returns different IP addresses based on user location. Simple but slow to fail over—DNS caching means changes take minutes to propagate.

Anycast advertises the same IP address from multiple regions. Network routing naturally sends users to the closest one. Failover is automatic. But it requires control over BGP routing, which is complex.

Global load balancers (Route 53, Google Cloud Load Balancing, Azure Traffic Manager) provide sophisticated routing with health checks and fast failover. These make multi-region accessible without deep networking expertise.

The Cost

Geographic redundancy is expensive in every dimension:

Infrastructure: You're paying for multiple regions' worth of everything.

Data transfer: Replicating data between regions incurs significant charges. Cloud providers price cross-region transfer aggressively.

Complexity: More regions mean more monitoring, more deployment coordination, more network configuration. Debugging becomes harder—is this issue regional? Is it replication lag? Is it a routing problem?

Latency: Cross-region communication is slow. Database transactions, distributed locks, and synchronous calls all suffer.

Testing

Untested multi-region architecture is theoretical redundancy.

You've built the infrastructure. You've configured the replication. You've set up the routing. Does it actually work?

Regular failover drills: Deliberately shift traffic between regions. Does everything still work? How long does it take?

Chaos engineering: Fail entire regions in production. Does automatic failover actually happen? How much data is lost?

Load testing: Can remaining regions handle the traffic from a failed region without degrading?

Most organizations discover their failover doesn't work during an actual failure. That's an expensive way to learn.

When It's Worth It

Not every system needs geographic redundancy.

It's worth the cost when:

Regional outages are genuinely unacceptable to your business
You have users worldwide and need low latency everywhere
Regulations require data residency in specific regions
You're already at scale where multi-region makes operational sense

For many systems, multi-availability-zone within a single region provides sufficient protection at a fraction of the cost and complexity. Your data center might flood, but all three availability zones in a region won't flood simultaneously.

Geographic redundancy is powerful protection. It's also trading one hard problem—regional failure—for another hard problem—distributed state. The art is knowing which problem you'd rather have.

Was this page helpful?

😔

🤨

😃