#Database Replication Explained
Database replication is the process of keeping copies of the same data on multiple nodes. It is one of the most fundamental tools for building systems that are fast, resilient, and always on.
Replication solves three problems at once. It scales reads by spreading queries across nodes, improves availability by removing single points of failure, and improves durability so data survives the loss of any one machine.
This guide explains the core patterns, the real tradeoffs, and how to reason about them clearly in an interview.
The Core Concept
In the most common setup, one node is designated the leader (also called the primary or master). All writes go to the leader. The leader then propagates those changes to one or more followers (also called replicas or slaves).
Read traffic can be spread across followers. This offloads the leader and lets you serve many more read queries than a single node could handle.
The terms “primary-replica” and “master-slave” describe the same topology. “Leader-follower” is the modern preferred term and is what most interviewers will expect you to use.
Every write still passes through a single leader. Replication does not help write throughput - it only scales reads. This is a common misconception worth knowing.
Synchronous vs Asynchronous Replication
How the leader sends changes to followers is one of the most important decisions in a replicated system. There are two models.
Synchronous replication means the leader waits for at least one follower to confirm it has written the data before acknowledging the write to the client. The client only sees success after the data is durable in two places.
This is the safer choice. If the leader crashes immediately after a write, a synchronous follower already holds the data. No acknowledged write is ever lost.
The cost is higher write latency, because the leader must wait for a network round-trip to the follower before responding.
There is also an availability risk: if the synchronous follower goes down, the leader must either block writes or switch another follower to synchronous mode. This can stall write traffic under partial failures.
Asynchronous replication means the leader writes to its own log and immediately acknowledges the write to the client. The changes are sent to followers in the background.
This gives lower write latency and keeps the leader independent of follower availability. The risk is replication lag - the gap between what the leader has written and what a follower has applied.
If the leader crashes before a follower has caught up, recently-acknowledged writes can be lost permanently.
Replication lag is the term for this delay, measured in bytes of log or in seconds of time. Even a few seconds of lag can cause stale reads on followers.
A client that just wrote a record and then reads from a follower may not see their own write. This is known as a read-after-write consistency violation.
Tradeoffs
Failover and leader promotion are the operational consequences of replication. When the leader fails, a follower must be promoted to take over.
With synchronous replication, at least one follower is guaranteed to be fully up to date, making promotion safe.
With asynchronous replication, no follower is guaranteed to have all acknowledged writes - choosing which follower to promote requires accepting some potential data loss.
Read-after-write consistency is the most common consistency problem caused by lag. If your application routes reads to followers and writes to the leader, users will sometimes read stale data.
The fix is to route reads to the leader for data the current user just wrote, while sending all other reads to followers.
Multi-leader and leaderless replication are alternatives for workloads that need higher write throughput or cross-region writes. In multi-leader setups, more than one node accepts writes, and the nodes reconcile conflicts.
In leaderless setups (as used by Amazon Dynamo and Apache Cassandra), any node can accept a write and quorum reads/writes are used to maintain consistency. Both approaches trade conflict resolution complexity for higher write availability.
These consistency choices connect directly to the CAP theorem. A synchronous leader-follower setup leans toward CP: it favours consistency over availability when a follower is down.
An asynchronous setup leans toward AP: it stays available at the cost of potential staleness.
How It Comes Up in Interviews
Interviewers will rarely ask you to define replication in isolation. They will ask you to reason about it in the context of a system design problem.
Why would you add read replicas?
Read replicas distribute read traffic across multiple nodes, reducing load on the leader. This is the right move when your read-to-write ratio is high and the leader is the bottleneck for reads, not writes.
PostgreSQL, MySQL, and most hosted databases support this natively.
What is replication lag and why does it matter?
Replication lag is the delay between a write being committed on the leader and that write being applied on a follower. It matters because any read served by a lagging follower may return stale data.
For user-facing features where read-after-write consistency is required, you must route those reads to the leader or use a sticky session strategy.
What happens when the leader fails?
A follower must be promoted to become the new leader. In synchronous setups, at least one follower is fully caught up and can be promoted safely.
In asynchronous setups, the most up-to-date follower is chosen, but any writes that had not yet replicated are lost.
The application must also redirect writes to the new leader’s address, typically via a service discovery layer or a load balancer.
How do you choose between synchronous and asynchronous replication?
The choice depends on your durability and latency requirements. Use synchronous replication when losing any acknowledged write is unacceptable - financial transactions, order records, audit logs.
Use asynchronous replication when write latency matters more than absolute durability and your application can tolerate some potential data loss on failover - analytics pipelines, user activity feeds, or caches.
What’s Next
- Database Sharding and Partitioning - how to split data across nodes when replication alone cannot handle write load.
- Consistency and the CAP Theorem - the theoretical framework behind the consistency guarantees replication makes possible.
- Caching in System Design - how caching layers interact with replicated databases and where each belongs in an architecture.
Frequently Asked Questions
What is database replication?
Database replication is the practice of maintaining identical copies of data on more than one node. One node (the leader) accepts writes and propagates changes to one or more followers.
Followers serve reads and can be promoted to leader if the current leader fails.
What is the difference between replication and sharding?
Replication copies the full dataset to multiple nodes to improve read throughput, availability, and durability. Sharding splits the dataset into smaller partitions spread across nodes to improve write throughput.
They are complementary techniques and are often used together.
What is replication lag?
Replication lag is the delay between a write being committed on the leader and that same write being applied on a follower. It is caused by the time needed to transfer and apply the write log over the network.
Lag is measured in seconds or bytes. High lag means followers serve stale reads. Synchronous replication eliminates lag at the cost of write latency.
Does replication improve write performance?
No. In a leader-follower setup, all writes still pass through a single leader. Adding more followers does not increase write throughput.
Replication scales read capacity and improves durability and availability, but writes remain bottlenecked on the leader. To scale writes you need sharding or multi-leader replication.
Share this article
Continue Learning
Consistency and the CAP Theorem: A System Design Guide
A clear, interview-focused guide to the CAP theorem and consistency models - what it means, the tradeoffs, and how it comes up in system design interviews.
Database Sharding and Partitioning
How database sharding and partitioning scale your data layer - shard keys, hotspots, rebalancing, and the tradeoffs to discuss in a system design interview.
Caching in System Design
An interview-focused guide to caching: cache-aside, write-through, write-back, eviction, TTLs, and cache invalidation tradeoffs.
Message Queues in System Design
Learn how message queues decouple services and absorb load spikes, plus queues vs pub/sub, delivery guarantees, and backpressure tradeoffs.