DNS Failover and High Availability

Updated 1 day ago

DNS has a fundamental problem: to be fast, it must cache. But to fail over, caches must expire. You can't have both.

This tension sits at the heart of every DNS-based high availability system. When a server dies, you can update DNS records in milliseconds—but those updates mean nothing to the millions of resolvers still holding cached copies of the old address. They'll keep sending traffic to a dead server until their cache expires.

Understanding this limitation—and the clever workarounds engineers have built—is essential for designing systems that actually stay up.

How DNS Failover Works

DNS failover monitors your servers and automatically updates DNS records when something breaks. Instead of a human noticing an outage, logging into a control panel, and updating records, automated health checks detect the failure and swap in a backup server's address within seconds.

The mechanism is straightforward: your DNS provider continuously probes your servers—via HTTP, TCP, or ICMP ping—to verify they're responding. When checks fail, the provider removes the dead server's IP from DNS responses or replaces it with a backup. When the original recovers, it's added back.

This automation matters because humans are slow. By the time someone notices a failure, investigates, and makes changes, customers have already experienced downtime. Automated failover reduces mean time to recovery from minutes or hours to seconds.

Health Checks That Actually Work

A ping proves a server is reachable. It doesn't prove your application is running.

A server can respond to ping while the web server process has crashed, the database connection has failed, or the application is returning 500 errors to every request. Simple health checks miss these failures entirely.

Effective health checks verify what matters: your application, end to end. An HTTP health check requests a dedicated endpoint—/health or /status—and verifies it returns 200 OK with expected content. Better checks query the database, verify cache connectivity, confirm third-party APIs are reachable. They answer the question users actually care about: "Will this server handle my request correctly?"

Timing matters too. Checking every 30 seconds with a threshold of three consecutive failures means up to 90 seconds to detect an outage. Checking every 10 seconds with a single failure threshold detects problems faster but risks false positives from transient network blips. The right balance depends on what costs more: a minute of undetected downtime or an unnecessary failover.

Geography complicates things further. A server might appear down from Virginia but respond perfectly from Frankfurt—network routing issues affect some paths but not others. Sophisticated providers probe from multiple global locations and only trigger failover when a server fails from several vantage points simultaneously.

The TTL Tradeoff

Time To Live determines how long resolvers cache your DNS records. It's also the throttle on your failover speed.

A 3600-second TTL means resolvers can cache your records for an hour. Update DNS to point to a backup server, and some clients will still hit the dead server for another 59 minutes. For critical services, this is catastrophic.

Lower the TTL to 60 seconds, and failover propagates within a minute or two. But now every resolver queries your nameservers every minute instead of every hour. Query volume increases dramatically. Latency increases slightly—more DNS lookups mean more round trips.

Most production systems settle on 60-300 seconds for records that might need failover. Some implement dynamic TTLs: long values during normal operation, automatically shortened when health checks detect instability.

But here's the uncomfortable truth: TTL is a suggestion, not a command. Some resolvers ignore it and cache longer. Operating systems add their own caching layer. Browsers maintain yet another cache. Even with a 60-second TTL, some users will experience delays until every cache layer expires.

Nameserver Redundancy

Your application can fail over between servers, but what happens when DNS itself goes down?

Every domain registers multiple nameservers—typically two to four. These must be distributed across different IP addresses, different networks, ideally different geographic regions. When one nameserver becomes unreachable, resolvers automatically query the others. This redundancy is baked into the DNS protocol.

But distribution matters. Four nameservers in the same data center provides zero protection against that data center losing power. Four nameservers on the same network provider provides zero protection against that provider having an outage. Real redundancy requires geographic separation across multiple autonomous systems.

Many organizations run hybrid configurations: two nameservers in their own infrastructure, two with a commercial provider like Cloudflare or Route 53. This protects against both self-inflicted infrastructure failures and provider-level outages.

Anycast: Failover at Network Speed

Anycast does something that sounds impossible: it announces the same IP address from multiple locations simultaneously.

Query an anycast DNS address, and the Internet's routing protocols—BGP, specifically—automatically direct your query to the nearest node. The same IP address exists in Tokyo, London, and New York simultaneously. The routing protocols don't know or care—they just find the closest one.

This provides failover faster than DNS could ever achieve. When a data center goes offline, BGP stops routing traffic there within seconds. No health checks, no TTL expiration, no propagation delay. Traffic automatically flows to the next-nearest location.

Anycast also distributes load globally. A DDoS attack overwhelming servers in one location doesn't affect other locations—queries route around the problem. Performance improves too: queries answered by nearby servers don't traverse continents.

Building your own anycast network requires IP address allocations, BGP peering agreements, and globally distributed infrastructure. This is why most organizations use specialized DNS providers—Cloudflare, Google, Route 53—who've already built it.

Where DNS Failover Falls Short

DNS failover is valuable. It's also limited. Understanding the limitations helps you build systems that actually survive failures.

Caching delays everything. You can update DNS instantly, but clients honor their cached values until TTL expires. Application-level failover—a load balancer detecting a dead backend and routing around it—takes effect immediately.

DNS can't see load. It redirects traffic between servers but can't make intelligent decisions based on CPU utilization, response times, or queue depth. Application-level load balancers can distribute traffic based on real-time server health.

Sessions don't survive. A user with an active session on Server A gets redirected to Server B after failover—and their session is gone unless you've implemented session sharing. Application-level failover maintains session affinity more reliably.

Partial failures are invisible. A server responding slowly or failing 10% of requests might pass health checks. It's degraded, not dead, and DNS health checks typically can't tell the difference. Application-level circuit breakers can detect and respond to degradation.

The solution isn't choosing between DNS and application-level failover. It's using both. DNS failover handles coarse-grained redundancy between data centers. Application-level failover handles fine-grained routing between servers. Together, they create defense in depth—failures at any layer get caught by another.

Frequently Asked Questions About DNS Failover

Managed DNS vs. Self-hosted DNS

Was this page helpful?

😔

🤨

😃