Load Balancing Algorithms

Updated 10 hours ago

Every load balancer faces the same problem: requests arrive, and they need to go somewhere. But "somewhere" isn't good enough. The request needs to go to the right server—the one that will handle it fastest, the one that isn't already overwhelmed, the one that makes sense for this particular request.

Load balancing algorithms are different answers to a deceptively simple question: what does "fair" mean when requests aren't equal?

A request that takes 10ms to process isn't the same as one that takes 10 seconds. A server with 64GB of RAM isn't the same as one with 8GB. A user who needs session continuity isn't the same as one making a stateless API call. "Fair" depends on what you're optimizing for.

Round-Robin: Taking Turns

The simplest possible answer: take turns. Request 1 goes to Server A, request 2 to Server B, request 3 to Server C, then back to A.

This works beautifully when everything is equal—same servers, same requests, same processing time. It requires no state, no measurement, no decision-making. Just rotate through the list.

But the moment anything becomes unequal, round-robin breaks down. If Server A is twice as powerful as Server B, giving them equal traffic wastes A's capacity while overloading B. If some requests take seconds while others take milliseconds, the server that gets unlucky with slow requests falls behind while others sit idle.

Round-robin defines "fair" as "equal turns." That's only actually fair when the turns are equal.

Weighted Round-Robin: Acknowledging Inequality

A server with weight 3 gets three requests for every one request a weight-1 server receives. Now you can mix a powerful server with smaller ones in the same pool.

This solves the capacity problem but not the workload problem. You still don't know which requests will be expensive until they're running. A weight-3 server might get three requests that each take 10 seconds while a weight-1 server gets one that finishes in 10ms.

Weighted round-robin defines "fair" as "proportional to capacity." Better, but still blind to what's actually happening.

Least Connections: Watching the Work

Stop guessing. Count.

The load balancer tracks how many active connections each server has. New requests go to the server with the fewest. If Server A has 50 connections and Server B has 30, the next request goes to B.

This adapts to reality. Servers handling slow requests accumulate connections; servers handling fast requests clear them quickly. Over time, work distributes based on actual capacity to do work, not assumed capacity.

The cost is state. The load balancer must track every connection—when it opens, when it closes. At scale, this tracking has overhead.

Least connections defines "fair" as "whoever has the most room." It watches the game instead of just taking turns.

Weighted Least Connections: Both Dimensions

Combine the ideas. A server with weight 3 and 30 connections is "less loaded" than a server with weight 1 and 15 connections, even though 30 > 15. The weight-3 server is at 33% of its capacity; the weight-1 server is at 150%.

This handles heterogeneous server pools with varying request durations—the most complex common scenario.

Least Response Time: Measuring What Matters

Connections are a proxy for load. Response time is the actual thing you care about.

The load balancer monitors how quickly each server responds. Requests flow toward faster servers. If Server A responds in 50ms and Server B in 200ms, A gets more traffic.

This optimizes directly for user experience. But it creates a feedback loop: slow servers get less traffic, which might make them faster (less load), which gets them more traffic, which makes them slower again. Some implementations combine response time with connection count to dampen oscillation.

There's also a starvation risk. A server that's slow because it's overloaded never gets a chance to recover because it keeps getting bypassed.

Least response time defines "fair" as "give work to whoever does it fastest." Ruthlessly efficient, potentially unstable.

IP Hash: Memory Without State

Some applications need the same user to reach the same server every time. Shopping carts, login sessions, anything stateful.

IP hash converts the client's IP address into a number, then uses modulo arithmetic to select a server. The same IP always produces the same number, always selects the same server.

No cookies. No session tracking. No communication between load balancer and application. Just math.

The catch: IP addresses aren't evenly distributed. A corporate NAT might send thousands of users from one IP, overloading one server while others are idle. And if the server pool changes—a server added or removed—the math changes, and sessions scatter.

IP hash defines "fair" as "consistent." The same client always gets the same answer, even if that creates imbalance.

Random: Embracing Chaos

Pick a server at random.

This sounds worse than round-robin, but at scale, random selection produces remarkably even distribution. The law of large numbers smooths out the chaos. And random requires no state at all—not even tracking whose turn it is.

Random defines "fair" as "let probability handle it." Surprisingly effective when you have enough requests.

Power of Two Choices: A Small Change That Matters

Here's something genuinely surprising: randomly pick two servers instead of one, then choose the one with fewer connections.

That's it. That tiny change—two random choices instead of one—dramatically improves load distribution. The mathematics are elegant: by avoiding the worst option between two random choices, you avoid the worst outcomes that pure random produces.

No global state required. No tracking of all connections across all servers. Just two dice rolls and a comparison.

This algorithm powers some of the largest distributed systems in the world. It's a beautiful example of how a small insight can have outsized effects.

Adaptive Algorithms: Everything at Once

Modern load balancers combine multiple signals:

Active connections
Response time
Health check status
CPU and memory utilization
Error rates

The algorithm weights these factors and routes requests to servers with the best composite score. Servers that are slow, overloaded, or throwing errors get less traffic. Healthy, fast servers get more.

This is the most sophisticated approach, but sophistication has costs. More signals mean more monitoring infrastructure. More factors mean more ways for the algorithm to behave unexpectedly. Debugging "why did this request go to that server?" becomes harder when the answer involves five weighted factors.

Adaptive algorithms define "fair" as "whatever the math says is optimal." Powerful, but opaque.

Choosing an Algorithm

The choice follows from your constraints:

Homogeneous servers, similar requests: Round-robin. Simple works.

Different server capacities: Weighted round-robin or weighted least connections.

Varying request durations: Least connections. Let reality guide distribution.

Session persistence required: IP hash, or sticky sessions at the application layer.

Latency is critical: Least response time, carefully tuned.

Distributed system, no central state: Power of two choices.

Complex production environment: Adaptive algorithms with proper monitoring.

Most load balancers let you change algorithms without downtime. If one doesn't work, try another. Watch connection distribution, response times, error rates. The data will tell you whether your definition of "fair" matches reality.

Frequently Asked Questions About Load Balancing Algorithms

Was this page helpful?

😔

🤨

😃