Load Balancing in System Design: Algorithms, Health Checks & Failover (Visualized)

Load balancing is the practice of distributing incoming network requests across a pool of backend servers so that no single server becomes a bottleneck. It is the foundation of horizontal scaling: instead of buying one bigger machine, you add more machines and spread the work across them.

A load balancer sits between your clients and your servers. Clients connect to a single address; the balancer forwards each request to one healthy backend. The result is higher throughput, better fault tolerance, and the ability to add or remove servers without downtime.

How a Load Balancer Works

Every load balancer does three jobs: (1) accept client connections at a single virtual address, (2) choose a backend server using a scheduling algorithm, and (3) continuously check backend health so dead servers are skipped. The algorithm in step 2 is where most of the interesting decisions live.

Algorithm 1: Round-Robin

Round-robin hands each new request to the next server in rotation: server 1, then 2, then 3, then back to 1. It is the simplest strategy and works well when all servers are identical and every request costs roughly the same amount of work.

Round-robin load balancing

Each request goes to the next server in rotation. Counters show requests handled per server.

Weighted round-robin is a common variation: each server is assigned a weight proportional to its capacity, so a server with weight 3 receives three times as many requests as a server with weight 1. Use it when your fleet has mixed hardware.

Algorithm 2: Least Connections

Round-robin breaks down when requests take wildly different amounts of time — one slow request can pile up on a server while round-robin keeps sending it more. Least-connections fixes this by routing each new request to the server with the fewest active connections, so load naturally evens out.

Least-connections load balancing

Each request is sent to the server with the fewest active connections; bars show live load. Requests hold a server for a random duration, so least-connections keeps the fleet balanced.

Use least-connections for long-lived or uneven workloads — WebSocket connections, streaming, slow database-backed endpoints — where request duration varies a lot.

Algorithm 3: Hash-Based (Sticky) Routing

Sometimes you want the same client to always reach the same server — for in-memory session state or cache locality. IP hashing or consistent hashing routes based on a hash of the client IP (or a key), so the same input always maps to the same server. Consistent hashing further minimizes how many keys move when servers are added or removed.

Health Checks & Automatic Failover

A load balancer continuously probes each backend (an HTTP GET /health, a TCP connect, etc.). If a server stops responding, it is pulled out of rotation automatically, so a single dead machine never takes down the service. When the server recovers, it is added back. This is what turns a pile of servers into a self-healing service.

Health checks and failover

When Server 2 fails its health check it is removed from rotation and traffic reroutes to healthy servers; it rejoins automatically on recovery.

Layer 4 vs Layer 7

Layer 4 (transport) load balancers route by IP and port. They are extremely fast and protocol-agnostic, but blind to request content. Layer 7 (application) load balancers understand HTTP, so they can route by URL path (/api vs /static), host header, cookies, or headers — at a small CPU cost.

	Layer 4	Layer 7
Routes by	IP + port	HTTP path, host, headers, cookies
Speed	Very fast	Slightly slower (parses HTTP)
Features	Basic forwarding	Path routing, TLS termination, rewrites
Examples	AWS NLB, IPVS	Nginx, HAProxy, AWS ALB, Envoy

Choosing an Algorithm

Algorithm	How it picks	Best for
Round-robin	Next server in rotation	Uniform servers, similar request cost
Weighted round-robin	Proportional to capacity	Mixed hardware sizes
Least connections	Fewest active connections	Long-lived / uneven requests
IP / consistent hash	Hash of client or key	Sticky sessions, cache locality

Common Pitfalls

Single point of failure: the load balancer itself must be redundant (active-passive or active-active), or it becomes the thing that takes you down. Sticky sessions everywhere: they undermine even load distribution and make failover lose state — prefer stateless servers with shared session storage. No health checks: without them, the balancer happily forwards traffic into a black hole.

Frequently Asked Questions

What is the difference between a load balancer and a reverse proxy?

A reverse proxy forwards client requests to backend servers and can add caching, TLS termination, and compression. A load balancer is a reverse proxy whose primary job is distributing traffic across many backends. In practice tools like Nginx and HAProxy are both.

Which load balancing algorithm is best?

There is no single best — it depends on your traffic. Use round-robin for uniform servers and short requests, least-connections for long-lived or variable requests, and hash-based routing when you need stickiness. Weighted variants handle mixed hardware.

Does a load balancer improve availability?

Yes. By spreading traffic across redundant servers and removing unhealthy ones via health checks, a load balancer lets the service survive individual server failures — provided the balancer itself is made redundant.

A load balancer turns a pile of servers into a single, resilient service. Add health checks and the whole thing heals itself.
— alokknight Engineering