In distributed systems, keeping data consistent and available across multiple servers is a major challenge. Should all users see the same data at the same time? What happens if a server crashes? How do you handle conflicts?
This blog covers:
✔ Consistency models (Strong, Eventual, Causal).
✔ Replication strategies (Leader-Follower, Multi-Leader, Peer-to-Peer).
✔ Real-world examples (DynamoDB, PostgreSQL, Cassandra).
✔ Trade-offs between consistency, latency, and fault tolerance.
Let’s dive in!
1. Why Replicate Data?
Replication stores copies of data on multiple servers to:
✅ Improve fault tolerance (if one server fails, others take over).
✅ Reduce latency (users read from the nearest replica).
✅ Increase read throughput (load balancing across replicas).
Example:
- Amazon replicates product catalogs globally so users get fast responses.
2. Data Consistency Models
A. Strong Consistency
- Guarantee: All reads return the latest write.
- Use Case: Banking, e-commerce (no stale data allowed).
- Downside: Higher latency (waits for all replicas to sync).
Example:
-- PostgreSQL with synchronous replication:
UPDATE accounts SET balance = 100 WHERE user_id = 1;
-- Blocks until all replicas confirm the write.
B. Eventual Consistency
- Guarantee: Reads eventually return the latest write (after some delay).
- Use Case: Social media, comments (temporary mismatches are okay).
- Downside: Users may see stale data temporarily.
Example:
- DynamoDB uses eventual consistency by default.
- If two users like a post at the same time, counts may briefly differ before syncing.
C. Causal Consistency
- Guarantee: Preserves cause-and-effect order (if event A leads to event B, all replicas see A before B).
- Use Case: Chat apps, collaborative editing.
Example:
- Slack messages appear in the correct order, even if servers are out of sync.
3. Replication Strategies
A. Leader-Follower (Primary-Replica)
- How it works:
- Leader handles writes, followers replicate data.
- Reads can go to followers (but may be stale).
- Consistency: Strong (if sync) or eventual (if async).
- Used by: PostgreSQL, MySQL, MongoDB.
Trade-offs:
✔ Simple to implement.
❌ Leader failure requires failover.
B. Multi-Leader (Multi-Primary)
- How it works:
- Multiple servers accept writes (must resolve conflicts later).
- Consistency: Eventual (conflicts possible).
- Used by: Cassandra, Google Spanner.
Trade-offs:
✔ Higher availability (no single point of failure).
❌ Conflict resolution is hard.
C. Peer-to-Peer (No Leader)
- How it works:
- Any node can read/write (e.g., DynamoDB, Riak).
- Consistency: Configurable (strong, eventual, or causal).
- Used by: DynamoDB, Cassandra (in some modes).
Trade-offs:
✔ Extremely fault-tolerant.
❌ Complex to manage.
4. Conflict Resolution in Replication
When two servers modify the same data simultaneously:
Strategy |
How It Works |
Last Write Wins (LWW) |
Uses timestamps (risks data loss). |
Version Vectors |
Tracks causality to merge changes. |
CRDTs (Conflict-Free Data Types) |
Automatically resolves conflicts (e.g., counters, sets). |
Example:
- Google Docs uses CRDTs for real-time collaborative editing.
5. Real-World Examples
A. Amazon DynamoDB
- Replication: Peer-to-peer + eventual consistency.
- Conflict Handling: LWW (Last Write Wins).
B. PostgreSQL
- Replication: Leader-follower with synchronous option.
- Consistency: Strong (if sync), eventual (if async).
C. Apache Cassandra
- Replication: Multi-leader + tunable consistency.
- Trade-off: Choose between strong or eventual consistency per query.
6. Best Practices
✅ Use strong consistency for financial data.
✅ Favor eventual consistency for high scalability (e.g., social media).
✅ Monitor replication lag (stale reads = unhappy users).
✅ Test failover scenarios (what happens if a leader crashes?).
7. What’s Next?
- Deep Dive: How Google Spanner Achieves Strong Consistency Globally.
- Case Study: How Twitter Handles Timeline Replication.
- Comparison: MongoDB Replica Sets vs. Cassandra Multi-DC.
Which topic should I cover next? Let me know in the comments! 🚀
Final Thoughts
- Strong consistency = Safety but higher latency.
- Eventual consistency = Speed but temporary mismatches.
- Choose replication based on your app’s needs.
Which consistency model does your app use? Share below! 👇