Replication in System Design: Leader-Follower, Multi-Leader, and Leaderless (Visualized)

Replication is the process of keeping identical copies of data on multiple machines (called replicas or nodes) so that the system remains available if one node fails and can serve read traffic from many locations simultaneously.

Every serious database — MySQL, PostgreSQL, MongoDB, Cassandra, Redis — ships with a replication mechanism. Despite the variety, the core problem is always the same: every write that happens on one node must eventually reach every other node holding a copy. How that propagation happens — who is authorised to accept writes, how long followers are allowed to lag, and what happens when two replicas disagree — is what separates the different replication models from one another.

Why Replication?

Replication serves three distinct goals that are worth keeping separate: high availability — if one machine dies, another replica already has the data and can keep serving requests; read scalability — read traffic can be spread across many replicas so no single node is overwhelmed; and reduced latency — you can place a replica in each geographic region so users read from a node that is physically close to them. A single-node database can achieve none of these three properties simultaneously.

Leader-Follower (Primary-Replica) Replication

The most common model is leader-follower replication (also called primary-replica or master-slave). One node — the leader — is the single source of truth for writes. Every write goes to the leader first; the leader then pushes a stream of change events (a replication log or change stream) to every follower. Followers apply those changes in order and serve read queries from their local copy of the data. Clients that need the freshest possible data read from the leader; clients that can tolerate slightly stale data read from any follower.

This model is simple to reason about: there is never a write conflict because only one node accepts writes. PostgreSQL streaming replication, MySQL binlog replication, and MongoDB replica sets all follow this pattern. The trade-off is that the leader is a write bottleneck — you cannot scale writes by adding more replicas, only reads.

Synchronous vs Asynchronous Replication and Replication Lag

When the leader receives a write it must decide whether to wait for at least one follower to confirm receipt before acknowledging success to the client. Synchronous replication waits: the leader holds the client's write transaction open until a designated follower (or a quorum of followers) has durably stored the change. This guarantees zero data loss if the leader crashes immediately after the write, but it adds latency — the client waits for a round-trip to the follower. Asynchronous replication does not wait: the leader acknowledges success the moment it writes to its own durable log, and ships the change to followers in the background. The client gets a fast response, but if the leader crashes before a follower catches up, those recent writes are lost.

The gap between what the leader has committed and what a follower has applied is called replication lag. Under normal conditions lag is a few milliseconds. Under heavy write load, a slow follower, or a network hiccup, lag can stretch to seconds or minutes. During that window, a client that reads from the lagging follower sees stale data — a consistency anomaly called reading your own writes or monotonic read violations. Applications that need stronger guarantees must either read from the leader, use a read-after-write mechanism, or use synchronous replication for at least one follower.

Asynchronous replication lag visualized

A write commits on the leader instantly; followers receive and apply it after a visible delay — the replication lag window.

Read Replicas: Scaling Reads Horizontally

Read replicas are followers that exist specifically to absorb read traffic. Because most production workloads are read-heavy (often 80-95% reads), adding read replicas can dramatically increase throughput without touching the leader at all. A typical architecture routes all INSERT, UPDATE, and DELETE statements to the leader's endpoint, and all SELECT statements to a load-balanced pool of read replicas. Cloud platforms make this easy: AWS RDS and Aurora, Google Cloud SQL, and Azure Database for PostgreSQL all let you add up to 5-15 read replicas with a single API call.

The key caveat is replication lag: a read replica may not yet reflect the latest write. Applications that can tolerate eventually-consistent reads (dashboards, analytics, recommendation feeds) are perfect candidates for read replicas. Flows that need read-your-own-writes consistency (a user viewing their just-submitted form, a financial balance check) must read from the leader or wait for confirmed replication.

Read scaling with read replicas

Writes go exclusively to the leader; read traffic fans out across all three replicas — each read request is served by whichever replica has lightest load.

Automatic Failover and Leader Promotion

In a leader-follower setup the leader is still a single point of failure: if it crashes, writes stop until a new leader is elected. Automatic failover is the mechanism by which the cluster detects a leader failure and promotes the most up-to-date follower to become the new leader, so the outage is measured in seconds rather than minutes of manual intervention.

The typical failover sequence is: (1) a health monitor detects that the leader has stopped responding to heartbeats; (2) the remaining followers hold an election and the follower with the highest replication offset wins; (3) the winner is promoted to leader and begins accepting writes; (4) other followers reconnect to the new leader and resume streaming the replication log; (5) clients are redirected — either through DNS failover, a virtual IP, or a connection proxy — to the new leader's endpoint. PostgreSQL uses pg_promote() combined with a tool like Patroni or repmgr. MySQL uses Group Replication or Orchestrator. MongoDB replica sets do this automatically with Raft-based elections.

Failover is not risk-free. If the old leader was ahead of the promoted follower when it crashed, those unconfirmed writes are permanently lost unless synchronous replication was in use. A misconfigured system can also experience split-brain — two nodes both believing they are the leader and both accepting writes, leading to divergent data that must be reconciled manually. This is why production systems add fencing tokens or STONITH (Shoot The Other Node In The Head) to guarantee only one leader can write at a time.

Leader failure and automatic follower promotion

The leader stops responding; after a timeout a follower is elected new leader and traffic resumes — the full failover cycle plays in a loop.

Multi-Leader Replication

In multi-leader replication (also called multi-master), more than one node accepts writes. Each leader replicates its writes to every other leader and all followers. The primary motivation is multi-datacenter deployments: a single-leader setup crossing datacenters means every write must traverse a WAN link to the leader, adding 50-200 ms of latency. With a leader in each datacenter, clients write locally and replication happens asynchronously between datacenters in the background.

The trade-off is write conflicts: if two clients in different datacenters update the same row simultaneously, both their local leaders accept the write, and when the replication logs meet, the system must decide which value wins. Common conflict-resolution strategies include: last-write-wins (LWW) — use wall-clock or logical timestamps to pick the later write, discarding the other; merge — combine both values (works for CRDTs like counters or sets); and application-level resolution — expose the conflict to the application to resolve (CouchDB, DynamoDB custom resolvers). Last-write-wins is popular but loses data silently; application-level resolution is safest but adds code complexity.

Leaderless Replication

In leaderless replication — pioneered by Amazon Dynamo and adopted by Cassandra and Riak — there is no single designated leader. Any node can accept any write. The client (or a coordinator) sends each write to W replicas and considers the write successful when at least W nodes confirm it. Reads are sent to R replicas simultaneously; the client picks the response with the highest version number. As long as W + R > N (where N is the total number of replicas), at least one node in every read set will have seen the latest write, giving strong consistency guarantees at the quorum level.

Leaderless systems handle node failures gracefully — any node can go down without stopping writes or reads as long as a quorum is available. The downside is that stale data can accumulate on lagging nodes. To fix this, leaderless databases use two repair mechanisms: read repair — when a read detects a stale value on one replica, it writes the fresher value back; and anti-entropy — a background process continuously compares replicas and copies missing data. Conflict resolution is still required (Cassandra uses LWW by default), but the absence of a single leader makes the system more resilient and naturally partition-tolerant.

Replication Models Compared

Property	Leader-Follower	Multi-Leader	Leaderless
Write scalability	Single node (write bottleneck)	Per-datacenter parallel writes	Any node accepts writes
Read scalability	Many read replicas	Local reads per datacenter	Quorum reads across N nodes
Conflict handling	No conflicts (one writer)	Required — LWW, merge, or app	Required — LWW, vector clocks
Failover	Automatic promotion needed	Other leaders keep serving	No leader to elect; quorum continues
Consistency	Strong (sync) or eventual (async)	Eventual (cross-DC lag)	Tunable (W + R vs N)
Best for	OLTP, most web apps	Multi-region active-active	High availability, write-heavy, global scale
Examples	PostgreSQL, MySQL, MongoDB	CockroachDB, MySQL Group Replication	Cassandra, DynamoDB, Riak

Synchronous vs Asynchronous: The Durability-Latency Trade-off

In practice, most systems use a hybrid approach: one follower is designated semi-synchronous and the rest are asynchronous. PostgreSQL's synchronous_standby_names lets you configure exactly which standby must confirm before a commit returns. MySQL's semi-synchronous replication plugin holds the commit until at least one replica has received (but not necessarily applied) the event. This guarantees that at least one replica is current at all times, giving you a safe failover target without the latency penalty of waiting for every follower.

Mode	Latency	Durability on leader crash	Throughput
Fully synchronous	High (waits for all replicas)	No data loss	Lower (blocked on slowest replica)
Semi-synchronous	Medium (waits for one replica)	At most one replica behind	Good
Fully asynchronous	Low (leader only)	Up to N seconds of data loss	Highest
Quorum (leaderless)	Medium (waits for W of N)	Tunable by W value	Flexible

Frequently Asked Questions

What is the difference between replication and sharding?

Replication keeps copies of the same data on multiple nodes — every replica holds the full dataset. Sharding (horizontal partitioning) splits the dataset so each node holds only a subset of the data. Replication improves read throughput and fault tolerance; sharding improves write throughput and allows datasets larger than a single machine's disk. Production systems often combine both: data is sharded across N shard groups, and each shard group is replicated across several nodes for redundancy.

How much replication lag is acceptable?

It depends entirely on the application's consistency requirements. For analytics dashboards or recommendation feeds, several seconds of lag is invisible to users and perfectly fine. For a social media timeline, a few hundred milliseconds is usually acceptable. For financial transactions or any flow where a user immediately reads what they just wrote, lag must be near-zero — use synchronous replication for at least one replica, or route those reads to the leader. You can monitor replica lag in PostgreSQL with SELECT now() - pg_last_xact_replay_timestamp() on the standby, and alert when it exceeds your SLA threshold.

Can replication replace backups?

No — replication and backups solve different problems. Replication protects you against hardware failure: if one node dies, another takes over within seconds. But replication copies every write — including accidental DELETE FROM users queries — to every replica almost immediately. A backup preserves a point-in-time snapshot of the data and lets you recover from human error or software bugs. Best practice is to run both: continuous replication for HA and fast failover, plus daily (or more frequent) snapshots and point-in-time recovery (PITR) for data-loss protection.

Replication multiplies your copies of data; it does not multiply your consistency guarantees. Choose your replication mode based on the worst-case lag your users can tolerate — not the average case.
— alokknight Engineering