Distributed Systems¶
A distributed system is a collection of independent computers that appears to its users as a single coherent system. These machines communicate by passing messages over a network and coordinate their actions to achieve a common goal. Unlike a single-node system, a distributed system must contend with partial failures, network partitions, clock skew, and the fundamental impossibility of achieving all desirable properties simultaneously. Understanding distributed systems theory is essential for any engineer building software that spans more than one machine, which today encompasses nearly every production system of meaningful scale.
Distributed systems arise out of necessity: no single machine can provide the storage capacity, computational power, fault tolerance, or geographic proximity required by modern applications. Whether it is a social media platform serving billions of users, a financial trading system requiring sub-millisecond latency, or a healthcare system that must never lose data, the underlying infrastructure is invariably distributed.
CAP Theorem and PACELC¶
CAP Theorem¶
The CAP theorem, formulated by Eric Brewer in 2000 and formally proved by Seth Gilbert and Nancy Lynch in 2002, states that a distributed data store can provide at most two out of three guarantees simultaneously:
- Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time.
- Availability (A): Every request receives a non-error response, without the guarantee that it contains the most recent write.
- Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes.
Since network partitions are inevitable in any distributed system (you cannot choose to avoid them), the practical choice is between CP (consistency over availability during a partition) and AP (availability over consistency during a partition).
CAP Theorem
┌───────────────────┐
│ │
│ Consistency │
│ (C) │
└────────┬──────────┘
/ \
/ \
CP / \ CA (not practical
/ \ in distributed systems)
/ \
┌───────────────┐ ┌───────────────┐
│ Partition │ │ Availability │
│ Tolerance (P) │───────────│ (A) │
└───────────────┘ AP └───────────────┘
CP Systems: HBase, MongoDB (default), etcd, ZooKeeper, Consul
AP Systems: Cassandra, DynamoDB, CouchDB, Riak
CA Systems: Traditional RDBMS (single-node only, not truly distributed)
Important nuances:
- CAP is about behavior during a partition, not during normal operation. When there is no partition, a system can be both consistent and available.
- Consistency in CAP refers specifically to linearizability (the strongest form), not consistency in the ACID sense.
- Availability in CAP means every non-failing node must return a response. A system that returns errors to some clients during a partition is not "available" in the CAP sense.
PACELC Theorem¶
Daniel Abadi proposed PACELC (2012) as a more nuanced extension of CAP. It states:
If there is a Partition, choose between Availability and Consistency; Else (during normal operation), choose between Latency and Consistency.
This captures the reality that even when the network is healthy, there is a fundamental tradeoff between how fast you can respond (latency) and how consistent the data is.
| System | During Partition (PAC) | Normal Operation (ELC) | Classification |
|---|---|---|---|
| DynamoDB | PA (availability) | EL (low latency) | PA/EL |
| Cassandra | PA (availability) | EL (low latency) | PA/EL |
| MongoDB | PC (consistency) | EC (consistency) | PC/EC |
| HBase | PC (consistency) | EC (consistency) | PC/EC |
| Cosmos DB | PA or PC (tunable) | EL or EC (tunable) | PA/EL or PC/EC |
| PNUTS | PC (consistency) | EL (low latency) | PC/EL |
PACELC is more useful in practice because most of the time a system is running without partitions, and the latency vs. consistency tradeoff is the one engineers face daily.
Consistency Models¶
Consistency models define the contract between a distributed data store and its clients regarding what values reads can return relative to writes. They range from strong (easier to reason about, higher cost) to weak (harder to reason about, lower cost).
Strong Consistency (Linearizability)¶
Linearizability is the strongest single-object consistency model. It guarantees that:
- Every operation appears to take effect instantaneously at some point between its invocation and completion.
- All operations are ordered in a way consistent with real-time ordering.
- Every read returns the value of the most recent completed write.
Linearizability makes a distributed system behave as if there is only a single copy of the data. It is expensive to implement because it typically requires coordination (consensus) on every write.
Use cases: Leader election, distributed locks, unique ID generation, financial transactions.
Sequential Consistency¶
Weaker than linearizability. All operations appear to execute in some sequential order, and the operations of each individual process appear in the order specified by its program. However, there is no guarantee that this order matches real-time ordering.
The key difference from linearizability: if process A completes a write before process B starts a read, sequential consistency does not guarantee that B sees A's write. Linearizability does.
Causal Consistency¶
Operations that are causally related must be seen by all nodes in the same order. Concurrent operations (those with no causal relationship) may be seen in different orders by different nodes.
Process A: write(x=1) ──────────────────────▶ write(x=3)
│ │
│ causally related │
▼ ▼
Process B: read(x) → 1 ──▶ write(y=2) read(x) → 3
│
▼
Process C: must see x=1 before x=3, must see y=2 after x=1
but y=2 and x=3 are concurrent, so order between them is flexible
Causal consistency is weaker than sequential consistency but stronger than eventual consistency. It is achievable without sacrificing availability during partitions (unlike linearizability).
Eventual Consistency¶
The weakest useful guarantee: if no new updates are made to a given data item, eventually all reads will return the last updated value. There is no bound on how long "eventually" takes.
Most AP systems provide eventual consistency. It requires conflict resolution strategies (last-writer-wins, vector clocks, CRDTs) when concurrent writes occur.
Read-Your-Writes (Session) Consistency¶
A practical middle ground: a client will always see its own writes. Other clients may see stale data. Often implemented by routing a user's requests to the same replica or by tracking write timestamps per session.
Comparison Table¶
| Model | Ordering Guarantee | Availability | Latency | Complexity | Example Systems |
|---|---|---|---|---|---|
| Linearizability | Real-time total order | Low | High | High | etcd, ZooKeeper, Spanner |
| Sequential | Total order (not real-time) | Medium | Medium | Medium | ZooKeeper (some operations) |
| Causal | Causal order preserved | High | Low | Medium | COPS, MongoDB (causal reads) |
| Read-Your-Writes | Client sees own writes | High | Low | Low | DynamoDB (session), Cassandra |
| Eventual | No ordering; convergence guaranteed | Highest | Lowest | Lowest | Cassandra, DynamoDB, CouchDB |
Consensus Algorithms¶
Consensus is the problem of getting multiple nodes to agree on a single value. It is fundamental to implementing linearizable storage, leader election, atomic broadcast, and distributed transactions. The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no deterministic consensus algorithm can guarantee termination in an asynchronous system where even one process may crash. Practical algorithms circumvent this by using timeouts and randomization.
Paxos¶
Paxos, invented by Leslie Lamport, was the first proven-correct consensus algorithm. It operates in two phases:
- Prepare phase: A proposer sends a prepare request with a proposal number
nto a majority of acceptors. Each acceptor promises not to accept proposals numbered less thannand returns any previously accepted value. - Accept phase: If the proposer receives promises from a majority, it sends an accept request with
nand a value (either the highest-numbered previously accepted value, or its own if none exists). Acceptors accept if they have not promised to a higher number.
Paxos is notoriously difficult to understand and implement. Multi-Paxos extends it for a sequence of values (a replicated log) but is underspecified, leading to many incompatible implementations.
Raft¶
Raft was designed by Diego Ongaro and John Ousterhout explicitly to be more understandable than Paxos while providing the same guarantees. It decomposes consensus into three sub-problems: leader election, log replication, and safety.
Raft Node States¶
Every node in a Raft cluster is in one of three states:
times out, receives votes
starts election from majority
┌──────────┐ ─────────────▶ ┌──────────────┐ ──────────────▶ ┌──────────┐
│ Follower │ │ Candidate │ │ Leader │
└──────────┘ ◀────────────── └──────────────┘ └──────────┘
▲ discovers current │
│ leader or new term │
│ │
└─────────────────────────────────────────────────────────────────┘
discovers node with higher term
Leader Election¶
- All nodes start as followers. Each follower has a randomized election timeout (e.g., 150-300ms).
- If a follower receives no heartbeat from a leader within its timeout, it transitions to candidate, increments its term number, and votes for itself.
- The candidate sends
RequestVoteRPCs to all other nodes. - A node grants its vote if: (a) it has not voted in this term, and (b) the candidate's log is at least as up-to-date as the voter's log.
- If the candidate receives votes from a majority, it becomes the leader and immediately sends heartbeats to establish authority.
- If another candidate wins (the node receives a heartbeat from a leader with a term >= its own), it reverts to follower.
- If no one wins (split vote), a new election starts after another randomized timeout.
Leader Election Timeline (5-node cluster)
──────────────────────────────────────────────────────────────────
Node A (Follower): ──────[timeout]──▶ Candidate (term=2)
│ RequestVote ──▶ B ✓
│ RequestVote ──▶ C ✓
│ RequestVote ──▶ D ✗ (voted for E)
│ RequestVote ──▶ E ✗
│
│ 3 votes (A,B,C) = majority
▼
Leader (term=2)
│
│ Heartbeat ──▶ B,C,D,E
▼
All others: Follower (term=2)
Log Replication¶
Once a leader is elected, it handles all client requests:
- The client sends a command to the leader.
- The leader appends the command to its log as a new entry (with the current term number).
- The leader sends
AppendEntriesRPCs to all followers in parallel. - Each follower appends the entry to its own log and responds with success.
- Once a majority of nodes have replicated the entry, the leader commits it (advances the commit index).
- The leader applies the committed entry to its state machine and responds to the client.
- Followers learn about committed entries via subsequent
AppendEntriesRPCs and apply them to their own state machines.
Log Replication (3-node cluster)
Client ──▶ Leader
│
│ AppendEntries(entry=[SET x=5, term=3, index=7])
├──────────────────────▶ Follower 1: appends, ACK ✓
├──────────────────────▶ Follower 2: appends, ACK ✓
│
│ Majority (3/3) replicated
▼
Committed (index 7)
│
│ Apply to state machine
▼
Response to client: OK
Leader Log: [1:SET a=1] [1:SET b=2] [2:SET a=3] [3:SET x=5]
Follower 1: [1:SET a=1] [1:SET b=2] [2:SET a=3] [3:SET x=5]
Follower 2: [1:SET a=1] [1:SET b=2] [2:SET a=3] [3:SET x=5]
term:cmd term:cmd term:cmd term:cmd
Safety Guarantees¶
Raft ensures that:
- Election safety: At most one leader per term.
- Leader append-only: A leader never overwrites or deletes entries in its log.
- Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical.
- Leader completeness: If an entry is committed in a given term, it will be present in the logs of all leaders for higher terms.
- State machine safety: If a node applies an entry at a given index, no other node will apply a different entry at the same index.
Distributed Transactions¶
A distributed transaction spans multiple nodes or services and must maintain atomicity: either all participants commit, or all abort. This is significantly harder than local transactions because any participant or the network can fail at any point.
Two-Phase Commit (2PC)¶
2PC is the classic protocol for atomic commitment across distributed nodes. It uses a designated coordinator and multiple participants.
Phase 1 — Prepare (Voting):
- The coordinator sends a
PREPAREmessage to all participants. - Each participant executes the transaction up to the point of committing, acquires locks, writes a prepare record to its write-ahead log, and responds with
VOTE_COMMITorVOTE_ABORT.
Phase 2 — Commit/Abort (Decision):
- If all participants voted
COMMIT, the coordinator writes a commit record to its log and sendsCOMMITto all participants. - If any participant voted
ABORT, the coordinator sendsABORTto all participants. - Each participant executes the decision (commit or rollback), releases locks, and acknowledges.
Two-Phase Commit Protocol
Coordinator Participant A Participant B
│ │ │
│──── PREPARE ───────────────▶│ │
│──── PREPARE ────────────────┼───────────────────────▶│
│ │ │
│ (execute tx, │ (execute tx, │
│ acquire locks, │ acquire locks, │
│ write WAL) │ write WAL) │
│ │ │
│◀─── VOTE_COMMIT ───────────│ │
│◀─── VOTE_COMMIT ───────────┼────────────────────────│
│ │ │
│ [All voted COMMIT] │ │
│ Write COMMIT to log │ │
│ │ │
│──── COMMIT ────────────────▶│ │
│──── COMMIT ─────────────────┼───────────────────────▶│
│ │ │
│ (apply changes, │ (apply changes, │
│ release locks) │ release locks) │
│ │ │
│◀─── ACK ───────────────────│ │
│◀─── ACK ───────────────────┼────────────────────────│
▼ ▼ ▼
Problems with 2PC:
- Blocking: If the coordinator crashes after sending
PREPAREbut before sending the decision, participants are stuck holding locks indefinitely. They cannot safely commit or abort because they do not know the coordinator's decision. - Single point of failure: The coordinator is a bottleneck and a single point of failure.
- Performance: Requires at least 2 round trips and forces synchronous disk writes at every step.
Three-Phase Commit (3PC)¶
3PC adds a PRE_COMMIT phase between prepare and commit to reduce the blocking window. After receiving all votes, the coordinator first sends PRE_COMMIT, and only after receiving acknowledgments does it send COMMIT. This allows participants to make a unilateral decision if the coordinator fails after PRE_COMMIT (they can safely commit). However, 3PC assumes a network with bounded delays and is not partition-tolerant. In practice, it is rarely used; systems prefer 2PC with coordinator recovery or Paxos-based commit protocols.
Saga Pattern¶
The Saga pattern is an alternative to distributed transactions for long-lived business processes across microservices. Instead of locking resources across services, a saga breaks the transaction into a sequence of local transactions, each with a compensating transaction that undoes its effects if a later step fails.
Choreography-Based Saga¶
Each service listens for events and decides locally whether to execute its step or a compensating action. There is no central coordinator.
Choreography Saga: Order Placement
Order Service Payment Service Inventory Service Shipping Service
│ │ │ │
│ create order │ │ │
│──── OrderCreated ────▶│ │ │
│ │ charge payment │ │
│ │──── PaymentCharged ──▶│ │
│ │ │ reserve stock │
│ │ │──── StockReserved ──▶│
│ │ │ │ schedule
│ │ │ │ shipping
│ │ │ │
│◀───────────────────── OrderConfirmed ─────────┼──────────────────────│
▼ ▼ ▼ ▼
Compensation (if Inventory fails):
Order Service Payment Service Inventory Service
│ │ │
│ │ │ stock unavailable
│ │◀── StockFailed ──────│
│ │ refund payment │
│◀── PaymentRefunded ──│ │
│ cancel order │ │
▼ ▼ ▼
Pros: No single point of failure, loosely coupled, each service is autonomous. Cons: Difficult to track overall saga state, hard to debug, risk of cyclic dependencies, eventually consistent.
Orchestration-Based Saga¶
A central saga orchestrator coordinates the sequence of steps, telling each service what to do and handling failures by invoking compensating transactions.
Orchestration Saga: Order Placement
Saga Orchestrator
│
│──── CreateOrder ──────────────▶ Order Service
│◀─── OrderCreated ─────────────│
│
│──── ChargePayment ────────────▶ Payment Service
│◀─── PaymentCharged ───────────│
│
│──── ReserveStock ─────────────▶ Inventory Service
│◀─── StockReserved ────────────│
│
│──── ScheduleShipping ─────────▶ Shipping Service
│◀─── ShippingScheduled ────────│
│
▼ Saga Complete
Compensation (if ReserveStock fails):
Saga Orchestrator
│
│◀─── StockFailed ─────────────── Inventory Service
│
│──── RefundPayment ────────────▶ Payment Service
│◀─── PaymentRefunded ──────────│
│
│──── CancelOrder ──────────────▶ Order Service
│◀─── OrderCancelled ───────────│
│
▼ Saga Rolled Back
Pros: Clear control flow, easier to manage and debug, centralized error handling. Cons: Orchestrator is a single point of failure (must be made resilient), tighter coupling to the orchestrator.
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coordination | Decentralized (event-driven) | Centralized (orchestrator) |
| Coupling | Loose | Medium (services depend on orchestrator) |
| Complexity | Grows with number of services | Centralized; easier to reason about |
| Failure handling | Each service handles its own | Orchestrator manages compensations |
| Visibility | Hard to trace end-to-end | Clear; orchestrator has full view |
| Best for | Simple sagas, few steps | Complex sagas, many steps |
Replication Strategies¶
Replication is the process of maintaining copies of data on multiple nodes to improve fault tolerance, read throughput, and geographic proximity. The central challenge is keeping replicas consistent while maintaining performance.
Single-Leader (Primary-Secondary) Replication¶
One node is designated the leader (primary). All writes go to the leader, which replicates changes to followers (secondaries). Reads can be served by the leader (strong consistency) or followers (eventual consistency with potential staleness).
Client Writes ──▶ Leader ──▶ Follower 1
│──────▶ Follower 2
│──────▶ Follower 3
Client Reads ──▶ Leader (strong) or any Follower (eventual)
Replication can be synchronous or asynchronous:
- Synchronous: The leader waits for at least one follower to confirm before acknowledging the write. Guarantees no data loss if the leader fails, but increases write latency.
- Asynchronous: The leader acknowledges immediately after its own write. Faster, but risks data loss if the leader fails before replication completes.
- Semi-synchronous: One follower is synchronous; the rest are asynchronous. Balances durability and performance.
Multi-Leader (Active-Active) Replication¶
Multiple nodes accept writes. Each leader replicates its changes to all other leaders. This is useful for multi-datacenter deployments where you want low-latency writes in each region.
Datacenter A Datacenter B Datacenter C
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Leader A │◀────────────▶│ Leader B │◀────────────▶│ Leader C │
└──────────┘ └──────────┘ └──────────┘
│ │ │
┌──────────┐ ┌──────────┐ ┌──────────┐
│Follower │ │Follower │ │Follower │
└──────────┘ └──────────┘ └──────────┘
The main challenge is write conflicts: two leaders may concurrently modify the same data. Conflict resolution strategies include:
- Last-Writer-Wins (LWW): Use timestamps; the latest write wins. Simple but can lose data.
- Custom resolution logic: Application-level merging (e.g., merging shopping carts).
- CRDTs (Conflict-free Replicated Data Types): Data structures that are mathematically guaranteed to converge without conflicts (counters, sets, registers).
CRDTs (Conflict-free Replicated Data Types)¶
CRDTs are data structures that can be replicated across multiple nodes and merged without coordination, guaranteeing eventual consistency and no conflicts. They achieve this through two key properties: commutativity (merge order does not matter) and idempotence (merging with self yields no change). Common variants include:
- G-Counter (Growing Counter): Each replica maintains a count per node; the merged value is the sum of all counts. Only increments are allowed, so concurrent updates never conflict.
- LWW-Register (Last-Writer-Wins Register): A register paired with a timestamp; the value with the highest timestamp wins. Used when overwrites are acceptable.
- OR-Set (Observed-Remove Set): A set that supports add and remove; removes are tracked so that concurrent add/remove converge correctly.
Example (Python): a simple G-Counter and merge:
from typing import Dict
class GCounter:
"""A grow-only counter CRDT. Each replica has a local count; merge sums all."""
def __init__(self) -> None:
self._counts: Dict[str, int] = {} # replica_id -> count
def inc(self, replica_id: str, delta: int = 1) -> None:
self._counts[replica_id] = self._counts.get(replica_id, 0) + delta
def value(self) -> int:
return sum(self._counts.values())
def merge(self, other: "GCounter") -> "GCounter":
"""Merge another G-Counter into this one (commutative, idempotent)."""
result = GCounter()
all_replicas = set(self._counts) | set(other._counts)
result._counts = {r: max(self._counts.get(r, 0), other._counts.get(r, 0)) for r in all_replicas}
return result
# Replica A and B each increment; after merge, total is consistent
a, b = GCounter(), GCounter()
a.inc("A", 3)
b.inc("B", 2)
merged = a.merge(b)
assert merged.value() == 5
For distributed databases and multi-region replication, see Databases — Sharding, Replication, and Partitioning.
Leaderless Replication¶
No single node is designated as leader. Any replica can accept reads and writes. The client sends writes to multiple replicas simultaneously and reads from multiple replicas, using quorum rules to determine the correct value.
Quorum reads and writes: With n replicas, a write must be acknowledged by w nodes and a read must query r nodes. As long as w + r > n, the read is guaranteed to overlap with at least one node that has the latest write.
Common configuration: n=3, w=2, r=2 — any two nodes must agree.
Leaderless Write (n=3, w=2)
Client ──▶ Node A: write(x=5) ✓
──▶ Node B: write(x=5) ✓ (w=2 met, acknowledge to client)
──▶ Node C: write(x=5) ✗ (node unreachable)
Leaderless Read (n=3, r=2)
Client ──▶ Node A: read(x) → 5
──▶ Node B: read(x) → 5 (r=2 met, return x=5)
──▶ Node C: (not queried or returns stale value)
Anti-entropy and read repair are used to bring stale replicas up to date. During a read, if a client detects a stale value from one replica, it can write the latest value back to that replica (read repair).
Comparison Table¶
| Strategy | Write Latency | Read Latency | Consistency | Conflict Handling | Fault Tolerance | Use Cases |
|---|---|---|---|---|---|---|
| Single-leader | Medium | Low (follower) | Strong or eventual | No conflicts | Leader failure = downtime until failover | Most OLTP databases |
| Multi-leader | Low (local) | Low (local) | Eventual | Required (LWW, CRDT) | Tolerates datacenter failure | Multi-region deployments |
| Leaderless | Low | Medium | Tunable (quorum) | Required (LWW, CRDT) | Tolerates individual node failures | Dynamo-style stores (Cassandra, Riak) |
Partitioning/Sharding Strategies¶
When a dataset is too large for a single node, it must be split across multiple nodes. Each piece is called a partition (or shard). Good partitioning distributes data and load evenly while minimizing cross-partition queries. For database-level partitioning, replication, and WAL/MVCC internals, see Databases — Sharding, Replication, and Partitioning.
Range Partitioning¶
Data is divided into contiguous ranges based on the partition key. For example, users with last names A-F on partition 1, G-L on partition 2, and so on.
Pros: Efficient range queries (e.g., "all users from A to C"), data locality for sequential access. Cons: Prone to hotspots if data is not uniformly distributed (e.g., more users with names starting with S than X).
Hash Partitioning¶
A hash function is applied to the partition key, and the hash value determines the partition. This distributes data uniformly regardless of the key distribution.
partition = hash(key) mod num_partitions
Example (4 partitions):
hash("user_123") = 2847193 → 2847193 mod 4 = 1 → Partition 1
hash("user_456") = 9381042 → 9381042 mod 4 = 2 → Partition 2
hash("user_789") = 5720318 → 5720318 mod 4 = 2 → Partition 2
Pros: Even distribution, avoids hotspots. Cons: Range queries are impossible (adjacent keys hash to different partitions). Adding or removing partitions requires remapping most keys (unless consistent hashing is used).
Consistent Hashing¶
Consistent hashing solves the problem of adding or removing nodes without remapping the entire dataset. Nodes and keys are both hashed onto a circular ring (0 to 2^32-1). A key is assigned to the first node encountered going clockwise from the key's position on the ring.
Consistent Hashing Ring
0 / 2^32
│
Node C (hash=350)
╱ ╲
╱ ╲
╱ ╲
Key K3 • ╲
(hash=300) ╲
╱ Node A (hash=90)
╱ │
│ │ ◀── Key K1 (hash=50)
│ │ assigned to Node A
│ │
│ Ring │
│ (clockwise) │
│ │
╲ ╱
╲ ╱
╲ ╱
╲ ╱
╲ ╱
Node B (hash=200)
│
Key K2 • (hash=150)
assigned to Node B
K1 (hash=50) → next node clockwise → Node A (hash=90)
K2 (hash=150) → next node clockwise → Node B (hash=200)
K3 (hash=300) → next node clockwise → Node C (hash=350)
When a node is added: Only keys between the new node and its predecessor on the ring need to be remapped. All other assignments remain unchanged.
When a node is removed: Only its keys are redistributed to the next node clockwise.
Virtual nodes: To avoid uneven distribution (since hash values may cluster), each physical node is assigned multiple positions on the ring (virtual nodes). A node with more capacity can be assigned more virtual nodes.
With Virtual Nodes:
Physical Node A → vnodes: A1(hash=90), A2(hash=220), A3(hash=310)
Physical Node B → vnodes: B1(hash=40), B2(hash=160), B3(hash=280)
This gives a more uniform distribution around the ring.
Rebalancing¶
When nodes are added or removed, data must be rebalanced. Strategies include:
- Fixed number of partitions: Create many more partitions than nodes (e.g., 1000 partitions for 10 nodes). When a node joins, it takes partitions from existing nodes. Partitions do not change; only their assignment changes. Used by Elasticsearch, Riak, Couchbase.
- Dynamic partitioning: Partitions split when they exceed a size threshold and merge when they shrink. Used by HBase, MongoDB.
- Proportional to nodes: Number of partitions is proportional to the number of nodes. When a new node joins, it randomly splits existing partitions. Used by Cassandra.
Clock Synchronization¶
In a distributed system, there is no shared global clock. Each node has its own physical clock, and these clocks drift. This creates fundamental challenges for ordering events, determining causality, and implementing timeouts.
The Problem with Physical Clocks¶
Physical clocks (wall-clock time) on different machines are never perfectly synchronized. Even with NTP (Network Time Protocol), clocks can differ by milliseconds to tens of milliseconds. In rare cases, NTP corrections can cause clocks to jump backward. This means:
- You cannot reliably determine which of two events on different machines happened first by comparing timestamps.
- Last-Writer-Wins (LWW) using wall-clock timestamps can silently drop writes.
- Google Spanner uses GPS-synchronized atomic clocks (TrueTime API) to bound clock uncertainty to a few milliseconds, but this requires specialized hardware.
Lamport Clocks (Logical Clocks)¶
Leslie Lamport introduced logical clocks (1978) to capture the happens-before relationship without relying on physical time. Each node maintains a counter:
- Before each local event, increment the counter.
- When sending a message, include the counter value.
- When receiving a message, set the counter to
max(local_counter, received_counter) + 1.
Node A Node B Node C
L=1 (event)
L=2 (send) ──────────────────▶ L=3 (receive)
L=1 (event) L=4 (event)
L=2 (event)
L=3 (send) ────────────────────▶ L=5 (receive)
L=3 (event)
L=4 (send) ──▶ L=5 (receive)
Limitation: Lamport clocks provide a total order of events, but they cannot determine if two events are concurrent or causally related. If L(a) < L(b), it does not necessarily mean a happened before b; they might be concurrent.
Vector Clocks¶
Vector clocks extend Lamport clocks to capture concurrency. Each node maintains a vector of counters, one per node in the system:
- Before each local event, increment your own entry.
- When sending a message, include the full vector.
- When receiving a message, take the element-wise maximum of your vector and the received vector, then increment your own entry.
Determining relationships:
- V(a) < V(b) (every component of V(a) <= V(b) and at least one is strictly less): a happened before b.
- Neither V(a) < V(b) nor V(b) < V(a): a and b are concurrent.
Node A Node B Node C
[1,0,0] (event)
[2,0,0] (send) ───▶ [2,1,0] (receive)
[2,2,0] (event)
[3,0,0] (event) [2,3,0] (send) ───▶ [2,3,1] (receive)
Compare [3,0,0] and [2,3,1]:
A: 3>2, B: 0<3, C: 0<1 → Neither dominates → CONCURRENT
Limitation: Vector size grows with the number of nodes. Not practical for systems with thousands of nodes.
Hybrid Logical Clocks (HLC)¶
HLCs combine physical timestamps with logical counters. They provide the benefits of logical clocks (causal ordering) while staying close to physical time (useful for snapshot reads and debugging). An HLC timestamp has two components: (physical_time, logical_counter).
- The physical component is always >= the node's wall-clock time.
- The logical counter breaks ties when physical times are equal.
- HLCs are bounded: the logical counter cannot grow unboundedly because it resets when the physical time advances.
Used by CockroachDB, MongoDB, and other systems that need causally ordered timestamps without specialized clock hardware.
Failure Detection and Fault Tolerance¶
In a distributed system, a node cannot distinguish between a crashed remote node and a very slow network. Failure detection is therefore inherently probabilistic — you can suspect a node has failed, but you cannot be certain.
Heartbeat-Based Detection¶
The simplest approach: each node periodically sends a heartbeat message to a monitor (or to all other nodes). If a heartbeat is not received within a timeout period, the node is suspected to have failed.
Node A ──── heartbeat (every 1s) ────▶ Monitor
Node B ──── heartbeat (every 1s) ────▶ Monitor
Node C ──── ✗ (no heartbeat for 5s) ──▶ Monitor: "Node C suspected failed"
Challenges:
- Too short timeout: False positives (healthy but slow nodes marked as failed). Causes unnecessary failovers, data rebalancing, and wasted resources.
- Too long timeout: Slow detection. Failed nodes keep receiving traffic, leading to errors and degraded user experience.
- Network congestion: Can cause heartbeats to be delayed, triggering false suspicion.
Phi Accrual Failure Detector¶
Instead of a binary alive/dead decision, the phi accrual failure detector outputs a suspicion level (phi, a continuous value). It works by:
- Recording the inter-arrival times of heartbeats from each node.
- Computing a statistical distribution of expected arrival times.
- Calculating phi based on how unlikely the current silence duration is, given the historical distribution.
A higher phi means higher suspicion. The application sets a threshold (e.g., phi > 8 means "consider failed"). This approach adapts automatically to varying network conditions and is used by Akka and Cassandra.
Phi (suspicion level)
│
12│ ╱ (long silence)
10│ ╱
8│─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╱─ ─ threshold
6│ ╱
4│ ╱
2│ ╱╲ ╱╲ ╱╲ ╱╲ ╱
0│───╱──╲──╱──╲──╱──╲──╱──╲─────╱──────────
└──────────────────────────────────────────── time
normal heartbeats heartbeat missed
Gossip-Based Failure Detection¶
Nodes periodically exchange membership information with random peers. Each node maintains a list of other nodes with heartbeat counters. If a node's counter has not incremented for a configured period, it is marked as suspected. This approach is decentralized and scales well to large clusters. Used by Cassandra and Consul.
Byzantine Fault Tolerance¶
Most practical distributed systems assume crash-fault (fail-stop) behavior: a failed node simply stops responding. Byzantine faults are a more adversarial model where a failed node can behave arbitrarily — it can send conflicting messages, lie, or collude with other faulty nodes.
The Byzantine Generals Problem (Lamport, 1982) shows that a system with n nodes can tolerate at most f Byzantine-faulty nodes if n >= 3f + 1. Protocols like PBFT (Practical Byzantine Fault Tolerance) achieve this but are expensive (O(n^2) message complexity per operation).
Byzantine fault tolerance is critical in:
- Blockchain/cryptocurrency networks (trustless, open participation).
- Safety-critical systems (aviation, space systems).
- Multi-party computation where participants may be adversarial.
Most enterprise distributed systems do not implement BFT because they operate in trusted environments where crash faults are the dominant failure mode.
The Eight Fallacies of Distributed Computing¶
Peter Deutsch (and later James Gosling) identified eight false assumptions that developers new to distributed computing tend to make. Internalizing these fallacies helps engineers design more robust systems.
1. The Network Is Reliable¶
Networks fail. Packets are dropped, cables are cut, switches crash, and cloud regions go offline. Design for retries, timeouts, idempotency, and graceful degradation. Every network call can fail, and often does under load.
2. Latency Is Zero¶
Even within the same datacenter, network round trips take hundreds of microseconds. Across regions, latency is 50-300ms. Chatty protocols (many small requests) perform far worse than batched ones. Design APIs to minimize round trips.
3. Bandwidth Is Infinite¶
Bandwidth is finite and shared. Large payloads (video, bulk data transfers) can saturate links, causing congestion and packet loss for all traffic. Compress data, paginate responses, and use streaming for large transfers.
4. The Network Is Secure¶
Every network boundary is an attack surface. Data in transit can be intercepted, modified, or replayed. Use TLS everywhere, authenticate all endpoints, and follow zero-trust principles. See Chapter 6 — Authentication & Security.
5. Topology Doesn't Change¶
Network topology changes constantly: VMs are migrated, load balancers are reconfigured, DNS records are updated, and nodes are added or removed. Do not hardcode IP addresses or assume static routing. Use service discovery and DNS-based resolution.
6. There Is One Administrator¶
In a distributed system, especially one spanning cloud providers or organizations, there are multiple administrators with different policies, update schedules, and priorities. You cannot assume coordinated maintenance windows or uniform configuration.
7. Transport Cost Is Zero¶
Serialization, deserialization, encryption, and network I/O all consume CPU, memory, and time. Data transfer between cloud regions or providers incurs monetary cost. Choose efficient serialization formats (Protocol Buffers, FlatBuffers) and minimize unnecessary data transfer.
8. The Network Is Homogeneous¶
Real networks consist of heterogeneous hardware, protocols, and configurations. Different parts of the network have different MTUs, buffer sizes, congestion control algorithms, and failure characteristics. Do not assume uniform network behavior.
Summary¶
Distributed systems are governed by fundamental theoretical constraints that no amount of engineering can circumvent. The CAP and PACELC theorems define the boundaries of what is achievable. Consistency models provide a spectrum of tradeoffs between correctness and performance. Consensus algorithms like Raft enable agreement among nodes but at a cost. Replication and partitioning strategies determine how data is distributed and how the system scales. Clock synchronization remains an unsolved problem in the general case, addressed through logical constructs. Failure detection is inherently probabilistic, and the eight fallacies remind us that distributed computing is fundamentally different from local computing.
Understanding these theoretical foundations is essential before attempting to design or operate any distributed system at scale.
| Topic | Key Takeaway |
|---|---|
| CAP / PACELC | You must choose; understand what your system sacrifices |
| Consistency models | Stronger consistency = higher latency and lower availability |
| Consensus (Raft/Paxos) | Enables agreement but requires majority availability |
| Distributed transactions | 2PC blocks on coordinator failure; sagas trade atomicity for availability |
| Replication | Single-leader is simplest; multi-leader and leaderless trade consistency for availability/latency |
| Partitioning | Consistent hashing minimizes remapping; virtual nodes improve balance |
| Clocks | Never trust wall-clock time for ordering; use logical or hybrid clocks |
| Failure detection | Binary detection causes false positives; use probabilistic approaches |
| Fallacies | Every assumption you make about the network is wrong |