Added detailed pseudocode for elliptic curve algorithms covering modular arithmetic, point operations, scalar multiplication, and coordinate conversions. Also introduced a comprehensive knowledge base for distributed systems, including CAP theorem, consistency models, consensus protocols (e.g., Paxos, Raft, PBFT, Nakamoto), and fault-tolerant design principles.
29 KiB
name, description
| name | description |
|---|---|
| distributed-systems | This skill should be used when designing or implementing distributed systems, understanding consensus protocols (Paxos, Raft, PBFT, Nakamoto, PnyxDB), analyzing CAP theorem trade-offs, implementing logical clocks (Lamport, Vector, ITC), or building fault-tolerant architectures. Provides comprehensive knowledge of consensus algorithms, Byzantine fault tolerance, adversarial oracle protocols, replication strategies, causality tracking, and distributed system design principles. |
Distributed Systems
This skill provides deep knowledge of distributed systems design, consensus protocols, fault tolerance, and the fundamental trade-offs in building reliable distributed architectures.
When to Use This Skill
- Designing distributed databases or storage systems
- Implementing consensus protocols (Raft, Paxos, PBFT, Nakamoto, PnyxDB)
- Analyzing system trade-offs using CAP theorem
- Building fault-tolerant or Byzantine fault-tolerant systems
- Understanding replication and consistency models
- Implementing causality tracking with logical clocks
- Building blockchain consensus mechanisms
- Designing decentralized oracle systems
- Understanding adversarial attack vectors in distributed systems
CAP Theorem
The Fundamental Trade-off
The CAP theorem, introduced by Eric Brewer in 2000, states that a distributed data store cannot simultaneously provide more than two of:
- Consistency (C): Every read receives the most recent write or an error
- Availability (A): Every request receives a non-error response (without guarantee of most recent data)
- Partition Tolerance (P): System continues operating despite network partitions
Why P is Non-Negotiable
In any distributed system over a network:
- Network partitions will occur (cable cuts, router failures, congestion)
- A system that isn't partition-tolerant isn't truly distributed
- The real choice is between CP and AP during partitions
System Classifications
CP Systems (Consistency + Partition Tolerance)
Behavior during partition: Refuses some requests to maintain consistency.
Examples:
- MongoDB (with majority write concern)
- HBase
- Zookeeper
- etcd
Use when:
- Correctness is paramount (financial systems)
- Stale reads are unacceptable
- Brief unavailability is tolerable
AP Systems (Availability + Partition Tolerance)
Behavior during partition: Continues serving requests, may return stale data.
Examples:
- Cassandra
- DynamoDB
- CouchDB
- Riak
Use when:
- High availability is critical
- Eventual consistency is acceptable
- Shopping carts, social media feeds
CA Systems
Theoretical only: Cannot exist in distributed systems because partitions are inevitable.
Single-node databases are technically CA but aren't distributed.
PACELC Extension
PACELC extends CAP to address normal operation:
If there is a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.
| System | P: A or C | E: L or C |
|---|---|---|
| DynamoDB | A | L |
| Cassandra | A | L |
| MongoDB | C | C |
| PNUTS | C | L |
Consistency Models
Strong Consistency
Every read returns the most recent write. Achieved through:
- Single leader with synchronous replication
- Consensus protocols (Paxos, Raft)
Trade-off: Higher latency, lower availability during failures.
Eventual Consistency
If no new updates, all replicas eventually converge to the same state.
Variants:
- Causal consistency: Preserves causally related operations order
- Read-your-writes: Clients see their own writes
- Monotonic reads: Never see older data after seeing newer
- Session consistency: Consistency within a session
Linearizability
Operations appear instantaneous at some point between invocation and response.
Provides:
- Single-object operations appear atomic
- Real-time ordering guarantees
- Foundation for distributed locks, leader election
Serializability
Transactions appear to execute in some serial order.
Note: Linearizability ≠ Serializability
- Linearizability: Single-operation recency guarantee
- Serializability: Multi-operation isolation guarantee
Consensus Protocols
The Consensus Problem
Getting distributed nodes to agree on a single value despite failures.
Requirements:
- Agreement: All correct nodes decide on the same value
- Validity: Decided value was proposed by some node
- Termination: All correct nodes eventually decide
Paxos
Developed by Leslie Lamport (1989/1998), foundational consensus algorithm.
Roles
- Proposers: Propose values
- Acceptors: Vote on proposals
- Learners: Learn decided values
Basic Protocol (Single-Decree)
Phase 1a: Prepare
Proposer → Acceptors: PREPARE(n)
- n is unique proposal number
Phase 1b: Promise
Acceptor → Proposer: PROMISE(n, accepted_proposal)
- If n > highest_seen: promise to ignore lower proposals
- Return previously accepted proposal if any
Phase 2a: Accept
Proposer → Acceptors: ACCEPT(n, v)
- v = value from highest accepted proposal, or proposer's own value
Phase 2b: Accepted
Acceptor → Learners: ACCEPTED(n, v)
- If n >= highest_promised: accept the proposal
Decision: Value is decided when majority of acceptors accept it.
Multi-Paxos
Optimization for sequences of values:
- Elect stable leader
- Skip Phase 1 for subsequent proposals
- Significantly reduces message complexity
Strengths and Weaknesses
Strengths:
- Proven correct
- Tolerates f failures with 2f+1 nodes
- Foundation for many systems
Weaknesses:
- Complex to implement correctly
- No specified leader election
- Performance requires Multi-Paxos optimizations
Raft
Designed by Diego Ongaro and John Ousterhout (2013) for understandability.
Key Design Principles
- Decomposition: Separates leader election, log replication, safety
- State reduction: Minimizes states to consider
- Strong leader: All writes through leader
Server States
- Leader: Handles all client requests, replicates log
- Follower: Passive, responds to leader and candidates
- Candidate: Trying to become leader
Leader Election
1. Follower times out (no heartbeat from leader)
2. Becomes Candidate, increments term, votes for self
3. Requests votes from other servers
4. Wins with majority votes → becomes Leader
5. Loses (another leader) → becomes Follower
6. Timeout → starts new election
Safety: Only candidates with up-to-date logs can win.
Log Replication
1. Client sends command to Leader
2. Leader appends to local log
3. Leader sends AppendEntries to Followers
4. On majority acknowledgment: entry is committed
5. Leader applies to state machine, responds to client
6. Followers apply committed entries
Log Matching Property
If two logs contain entry with same index and term:
- Entries are identical
- All preceding entries are identical
Term
Logical clock that increases with each election:
- Detects stale leaders
- Resolves conflicts
- Included in all messages
Comparison with Paxos
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Complex | Designed for clarity |
| Leader | Optional (Multi-Paxos) | Required |
| Log gaps | Allowed | Not allowed |
| Membership changes | Complex | Joint consensus |
| Implementations | Many variants | Consistent |
PBFT (Practical Byzantine Fault Tolerance)
Developed by Castro and Liskov (1999) for Byzantine faults.
Byzantine Faults
Nodes can behave arbitrarily:
- Crash
- Send incorrect messages
- Collude maliciously
- Act inconsistently to different nodes
Fault Tolerance
Tolerates f Byzantine faults with 3f+1 nodes.
Why 3f+1?
- Need 2f+1 honest responses
- f Byzantine nodes might lie
- Need f more to distinguish honest majority
Protocol Phases
Normal Operation (leader is honest):
1. REQUEST: Client → Primary (leader)
2. PRE-PREPARE: Primary → All replicas
- Primary assigns sequence number
3. PREPARE: Each replica → All replicas
- Validates pre-prepare
4. COMMIT: Each replica → All replicas
- After receiving 2f+1 prepares
5. REPLY: Each replica → Client
- After receiving 2f+1 commits
Client waits for f+1 matching replies.
View Change
When primary appears faulty:
- Replicas timeout waiting for primary
- Broadcast VIEW-CHANGE with prepared certificates
- New primary collects 2f+1 view-changes
- Broadcasts NEW-VIEW with proof
- System resumes with new primary
Message Complexity
- Normal case: O(n²) messages per request
- View change: O(n³) messages
Scalability challenge: Quadratic messaging limits cluster size.
Optimizations
- Speculative execution: Execute before commit
- Batching: Group multiple requests
- Signatures: Use MACs instead of digital signatures
- Threshold signatures: Reduce signature overhead
Modern BFT Variants
HotStuff (2019)
- Linear message complexity O(n)
- Used in LibraBFT (Diem), other blockchains
- Three-phase protocol with threshold signatures
Tendermint
- Blockchain-focused BFT
- Integrated with Cosmos SDK
- Immediate finality
QBFT (Quorum BFT)
- Enterprise-focused (ConsenSys/JPMorgan)
- Enhanced IBFT for Ethereum-based systems
Nakamoto Consensus
The consensus mechanism powering Bitcoin, introduced by Satoshi Nakamoto (2008).
Core Innovation
Combines three elements:
- Proof-of-Work (PoW): Cryptographic puzzle for block creation
- Longest Chain Rule: Fork resolution by accumulated work
- Probabilistic Finality: Security increases with confirmations
How It Works
1. Transactions broadcast to network
2. Miners collect transactions into blocks
3. Miners race to solve PoW puzzle:
- Find nonce such that Hash(block_header) < target
- Difficulty adjusts to maintain ~10 min block time
4. First miner to solve broadcasts block
5. Other nodes verify and append to longest chain
6. Miner receives block reward + transaction fees
Longest Chain Rule
When forks occur:
Chain A: [genesis] → [1] → [2] → [3]
Chain B: [genesis] → [1] → [2'] → [3'] → [4']
Nodes follow Chain B (more accumulated work)
Chain A blocks become "orphaned"
Note: Actually "most accumulated work" not "most blocks"—a chain with fewer but harder blocks wins.
Security Model
Honest Majority Assumption: Protocol secure if honest mining power > 50%.
Formal analysis (Ren 2019):
Safe if: g²α > β
Where:
α = honest mining rate
β = adversarial mining rate
g = growth rate accounting for network delay
Δ = maximum network delay
Implications:
- Larger block interval → more security margin
- Higher network delay → need more honest majority
- 10-minute block time provides safety margin for global network
Probabilistic Finality
No instant finality—deeper blocks are exponentially harder to reverse:
| Confirmations | Attack Probability (30% attacker) |
|---|---|
| 1 | ~50% |
| 3 | ~12% |
| 6 | ~0.2% |
| 12 | ~0.003% |
Convention: 6 confirmations (~1 hour) considered "final" for Bitcoin.
Attacks
51% Attack: Attacker with majority hashrate can:
- Double-spend transactions
- Prevent confirmations
- NOT: steal funds, change consensus rules, create invalid transactions
Selfish Mining: Strategic block withholding to waste honest miners' work.
- Profitable with < 50% hashrate under certain conditions
- Mitigated by network propagation improvements
Long-Range Attacks: Not applicable to PoW (unlike PoS).
Trade-offs vs Traditional BFT
| Aspect | Nakamoto | Classical BFT |
|---|---|---|
| Finality | Probabilistic | Immediate |
| Throughput | Low (~7 TPS) | Higher |
| Participants | Permissionless | Permissioned |
| Energy | High (PoW) | Low |
| Fault tolerance | 50% hashrate | 33% nodes |
| Scalability | Global | Limited nodes |
PnyxDB: Leaderless Democratic BFT
Developed by Bonniot, Neumann, and Taïani (2019) for consortia applications.
Key Innovation: Conditional Endorsements
Unlike leader-based BFT, PnyxDB uses leaderless quorums with conditional endorsements:
- Endorsements track conflicts between transactions
- If transactions commute (no conflicting operations), quorums built independently
- Non-commuting transactions handled via Byzantine Veto Procedure (BVP)
Transaction Lifecycle
1. Client broadcasts transaction to endorsers
2. Endorsers evaluate against application-defined policies
3. If no conflicts: endorser sends acknowledgment
4. If conflicts detected: conditional endorsement specifying
which transactions must NOT be committed for this to be valid
5. Transaction commits when quorum of valid endorsements collected
6. BVP resolves conflicting transactions
Byzantine Veto Procedure (BVP)
Ensures termination with conflicting transactions:
- Transactions have deadlines
- Conflicting endorsements trigger resolution loop
- Protocol guarantees exit when deadline passes
- At most f Byzantine nodes tolerated with n endorsers
Application-Level Voting
Unique feature: nodes can endorse or reject transactions based on application-defined policies without compromising consistency.
Use cases:
- Consortium governance decisions
- Policy-based access control
- Democratic decision making
Performance
Compared to BFT-SMaRt and Tendermint:
- 11x faster commit latencies
- < 5 seconds in worldwide geo-distributed deployment
- Tested with 180 nodes
Implementation
- Written in Go (requires Go 1.11+)
- Uses gossip broadcast for message propagation
- Web-of-trust node authentication
- Scales to hundreds/thousands of nodes
Replication Strategies
Single-Leader Replication
Clients → Leader → Followers
Pros: Simple, strong consistency possible Cons: Leader bottleneck, failover complexity
Synchronous vs Asynchronous
| Type | Durability | Latency | Availability |
|---|---|---|---|
| Synchronous | Guaranteed | High | Lower |
| Asynchronous | At-risk | Low | Higher |
| Semi-synchronous | Balanced | Medium | Medium |
Multi-Leader Replication
Multiple nodes accept writes, replicate to each other.
Use cases:
- Multi-datacenter deployment
- Clients with offline operation
Challenges:
- Write conflicts
- Conflict resolution complexity
Conflict Resolution
- Last-write-wins (LWW): Timestamp-based, may lose data
- Application-specific: Custom merge logic
- CRDTs: Mathematically guaranteed convergence
Leaderless Replication
Any node can accept reads and writes.
Examples: Dynamo, Cassandra, Riak
Quorum Reads/Writes
n = total replicas
w = write quorum (nodes that must acknowledge write)
r = read quorum (nodes that must respond to read)
For strong consistency: w + r > n
Common configurations:
- n=3, w=2, r=2: Tolerates 1 failure
- n=5, w=3, r=3: Tolerates 2 failures
Sloppy Quorums and Hinted Handoff
During partitions:
- Write to available nodes (even if not home replicas)
- "Hints" stored for unavailable nodes
- Hints replayed when nodes recover
Failure Modes
Crash Failures
Node stops responding. Simplest failure model.
Detection: Heartbeats, timeouts Tolerance: 2f+1 nodes for f failures (Paxos, Raft)
Byzantine Failures
Arbitrary behavior including malicious.
Detection: Difficult without redundancy Tolerance: 3f+1 nodes for f failures (PBFT)
Network Partitions
Nodes can't communicate with some other nodes.
Impact: Forces CP vs AP choice Recovery: Reconciliation after partition heals
Split Brain
Multiple nodes believe they are leader.
Prevention:
- Fencing (STONITH: Shoot The Other Node In The Head)
- Quorum-based leader election
- Lease-based leadership
Design Patterns
State Machine Replication
Replicate deterministic state machine across nodes:
- All replicas start in same state
- Apply same commands in same order
- All reach same final state
Requires: Total order broadcast (consensus)
Chain Replication
Head → Node2 → Node3 → ... → Tail
- Writes enter at head, propagate down chain
- Reads served by tail (strongly consistent)
- Simple, high throughput
Primary-Backup
Primary handles all operations, synchronously replicates to backups.
Failover: Backup promoted to primary on failure.
Quorum Systems
Intersecting sets ensure consistency:
- Any read quorum intersects any write quorum
- Guarantees reads see latest write
Balancing Trade-offs
Identifying Critical Requirements
-
Correctness requirements
- Is data loss acceptable?
- Can operations be reordered?
- Are conflicts resolvable?
-
Availability requirements
- What's acceptable downtime?
- Geographic distribution needs?
- Partition recovery strategy?
-
Performance requirements
- Latency targets?
- Throughput needs?
- Consistency cost tolerance?
Vulnerability Mitigation by Protocol
Paxos/Raft (Crash Fault Tolerant)
Vulnerabilities:
- Leader failure causes brief unavailability
- Split-brain without proper fencing
- Slow follower impacts commit latency (sync replication)
Mitigations:
- Fast leader election (pre-voting)
- Quorum-based fencing
- Flexible quorum configurations
- Learner nodes for read scaling
PBFT (Byzantine Fault Tolerant)
Vulnerabilities:
- O(n²) messages limit scalability
- View change is expensive
- Requires 3f+1 nodes (more infrastructure)
Mitigations:
- Batching and pipelining
- Optimistic execution (HotStuff)
- Threshold signatures
- Hierarchical consensus for scaling
Choosing the Right Protocol
| Scenario | Recommended | Rationale |
|---|---|---|
| Internal infrastructure | Raft | Simple, well-understood |
| High consistency needs | Raft/Paxos | Proven correctness |
| Public/untrusted network | PBFT variant | Byzantine tolerance |
| Blockchain | HotStuff/Tendermint | Linear complexity BFT |
| Eventually consistent | Dynamo-style | High availability |
| Global distribution | Multi-leader + CRDTs | Partition tolerance |
Implementation Considerations
Timeouts
- Heartbeat interval: 100-300ms typical
- Election timeout: 10x heartbeat (avoid split votes)
- Request timeout: Application-dependent
Persistence
What must be persisted before acknowledgment:
- Raft: Current term, voted-for, log entries
- PBFT: View number, prepared/committed certificates
Membership Changes
Dynamic cluster membership:
- Raft: Joint consensus (old + new config)
- Paxos: α-reconfiguration
- PBFT: View change with new configuration
Testing
- Jepsen: Distributed systems testing framework
- Chaos engineering: Intentional failure injection
- Formal verification: TLA+, Coq proofs
Adversarial Oracle Protocols
Oracles bridge on-chain smart contracts with off-chain data, but introduce trust assumptions into trustless systems.
The Oracle Problem
Definition: The security, authenticity, and trust conflict between third-party oracles and the trustless execution of smart contracts.
Core Challenge: Blockchains cannot verify correctness of external data. Oracles become:
- Single points of failure
- Targets for manipulation
- Trust assumptions in "trustless" systems
Attack Vectors
Price Oracle Manipulation
Flash Loan Attacks:
1. Borrow large amount via flash loan (no collateral)
2. Manipulate price on DEX (large trade)
3. Oracle reads manipulated price
4. Smart contract executes with wrong price
5. Profit from arbitrage/liquidation
6. Repay flash loan in same transaction
Notable Example: Harvest Finance ($30M+ loss, 2020)
Data Source Attacks
- Compromised API: Single data source manipulation
- Front-running: Oracle updates exploited before on-chain
- Liveness attacks: Preventing oracle updates
- Bribery: Incentivizing oracle operators to lie
Economic Attacks
Cost of Corruption Analysis:
If oracle controls value V:
- Attack profit: V
- Attack cost: oracle stake + reputation
- Rational to attack if: profit > cost
Implication: Oracles must have stake > value they secure.
Decentralized Oracle Networks (DONs)
Chainlink Model
Multi-layer Security:
1. Multiple independent data sources
2. Multiple independent node operators
3. Aggregation (median, weighted average)
4. Reputation system
5. Cryptoeconomic incentives (staking)
Data Aggregation:
Nodes: [Oracle₁: $100, Oracle₂: $101, Oracle₃: $150, Oracle₄: $100]
Median: $100.50
Outlier (Oracle₃) has minimal impact
Reputation and Staking
Node reputation based on:
- Historical accuracy
- Response time
- Uptime
- Stake amount
Job assignment weighted by reputation
Slashing for misbehavior
Oracle Design Patterns
Time-Weighted Average Price (TWAP)
Resist single-block manipulation:
TWAP = Σ(price_i × duration_i) / total_duration
Example over 1 hour:
- 30 min at $100: 30 × 100 = 3000
- 20 min at $101: 20 × 101 = 2020
- 10 min at $150 (manipulation): 10 × 150 = 1500
TWAP = 6520 / 60 = $108.67 (vs $150 spot)
Commit-Reveal Schemes
Prevent front-running oracle updates:
Phase 1 (Commit):
- Oracle commits: hash(price || salt)
- Cannot be read by others
Phase 2 (Reveal):
- Oracle reveals: price, salt
- Contract verifies hash matches
- All oracles reveal simultaneously
Schelling Points
Game-theoretic oracle coordination:
1. Multiple oracles submit answers
2. Consensus answer determined
3. Oracles matching consensus rewarded
4. Outliers penalized
Assumption: Honest answer is "obvious" Schelling point
Trusted Execution Environments (TEEs)
Hardware-based oracle security:
TEE (Intel SGX, ARM TrustZone):
- Isolated execution environment
- Code attestation
- Protected memory
- External data fetching inside enclave
Benefits:
- Verifiable computation
- Protected from host machine
- Cryptographic proofs of execution
Limitations:
- Hardware trust assumption
- Side-channel attacks possible
- Intel SGX vulnerabilities discovered
Oracle Types by Data Source
| Type | Source | Trust Model | Use Case |
|---|---|---|---|
| Price feeds | Exchanges | Multiple sources | DeFi |
| Randomness | VRF/DRAND | Cryptographic | Gaming, NFTs |
| Event outcomes | Manual report | Reputation | Prediction markets |
| Cross-chain | Other blockchains | Bridge security | Interoperability |
| Computation | Off-chain compute | Verifiable | Complex logic |
Defense Mechanisms
- Diversification: Multiple independent oracles
- Economic security: Stake > protected value
- Time delays: Allow dispute periods
- Circuit breakers: Pause on anomalous data
- TWAP: Resist flash manipulation
- Commit-reveal: Prevent front-running
- Reputation: Long-term incentives
Hybrid Approaches
Optimistic Oracles:
1. Oracle posts answer + bond
2. Dispute window (e.g., 2 hours)
3. If disputed: escalate to arbitration
4. If not disputed: answer accepted
5. Incorrect oracle loses bond
Examples: UMA Protocol, Optimistic Oracle
Causality and Logical Clocks
Physical clocks cannot reliably order events in distributed systems due to clock drift and synchronization issues. Logical clocks provide ordering based on causality.
The Happened-Before Relation
Defined by Leslie Lamport (1978):
Event a happened-before event b (a → b) if:
- a and b are in the same process, and a comes before b
- a is a send event and b is the corresponding receive
- There exists c such that a → c and c → b (transitivity)
If neither a → b nor b → a, events are concurrent (a || b).
Lamport Clocks
Simple scalar timestamps providing partial ordering.
Rules:
1. Each process maintains counter C
2. Before each event: C = C + 1
3. Send message m with timestamp C
4. On receive: C = max(C, message_timestamp) + 1
Properties:
- If a → b, then C(a) < C(b)
- Limitation: C(a) < C(b) does NOT imply a → b
- Cannot detect concurrent events
Use cases:
- Total ordering with tie-breaker (process ID)
- Distributed snapshots
- Simple event ordering
Vector Clocks
Array of counters, one per process. Captures full causality.
Structure (for n processes):
VC[1..n] where VC[i] is process i's logical time
Rules (at process i):
1. Before each event: VC[i] = VC[i] + 1
2. Send message with full vector VC
3. On receive from j:
for k in 1..n:
VC[k] = max(VC[k], received_VC[k])
VC[i] = VC[i] + 1
Comparison (for vectors V1 and V2):
V1 = V2 iff ∀i: V1[i] = V2[i]
V1 ≤ V2 iff ∀i: V1[i] ≤ V2[i]
V1 < V2 iff V1 ≤ V2 and V1 ≠ V2
V1 || V2 iff NOT(V1 ≤ V2) and NOT(V2 ≤ V1) # concurrent
Properties:
- a → b iff VC(a) < VC(b)
- a || b iff VC(a) || VC(b)
- Full causality detection
Trade-off: O(n) space per event, where n = number of processes.
Interval Tree Clocks (ITC)
Developed by Almeida, Baquero, and Fonte (2008) for dynamic systems.
Problem with Vector Clocks:
- Static: size fixed to max number of processes
- ID retirement requires global coordination
- Unsuitable for high-churn systems (P2P)
ITC Solution:
- Binary tree structure for ID space
- Dynamic ID allocation and deallocation
- Localized fork/join operations
Core Operations:
fork(id): Split ID into two children
- Parent retains left half
- New process gets right half
join(id1, id2): Merge two IDs
- Combine ID trees
- Localized operation, no global coordination
event(id, stamp): Increment logical clock
peek(id, stamp): Read without increment
ID Space Representation:
1 # Full ID space
/ \
0 1 # After one fork
/ \
0 1 # After another fork (left child)
Stamp (Clock) Representation:
- Tree structure mirrors ID space
- Each node has base value + optional children
- Efficient representation of sparse vectors
Example:
Initial: id=(1), stamp=0
Fork: id1=(1,0), stamp1=0
id2=(0,1), stamp2=0
Event at id1: stamp1=(0,(1,0))
Join id1+id2: id=(1), stamp=max of both
Advantages over Vector Clocks:
- Constant-size representation possible
- Dynamic membership without global state
- Efficient ID garbage collection
- Causality preserved across reconfigurations
Use cases:
- Peer-to-peer systems
- Mobile/ad-hoc networks
- Systems with frequent node join/leave
Version Vectors
Specialization of vector clocks for tracking data versions.
Difference from Vector Clocks:
- Vector clocks: track all events
- Version vectors: track data updates only
Usage in Dynamo-style systems:
Client reads with version vector V1
Client writes with version vector V2
Server compares:
- If V1 < current: stale read, conflict possible
- If V1 = current: safe update
- If V1 || current: concurrent writes, need resolution
Hybrid Logical Clocks (HLC)
Combines physical and logical time.
Structure:
HLC = (physical_time, logical_counter)
Rules:
1. On local/send event:
pt = physical_clock()
if pt > l:
l = pt
c = 0
else:
c = c + 1
return (l, c)
2. On receive with timestamp (l', c'):
pt = physical_clock()
if pt > l and pt > l':
l = pt
c = 0
elif l' > l:
l = l'
c = c' + 1
elif l > l':
c = c + 1
else: # l = l'
c = max(c, c') + 1
return (l, c)
Properties:
- Bounded drift from physical time
- Captures causality like Lamport clocks
- Timestamps comparable to wall-clock time
- Used in CockroachDB, Google Spanner
Comparison of Logical Clocks
| Clock Type | Space | Causality | Concurrency | Dynamic |
|---|---|---|---|---|
| Lamport | O(1) | Partial | No | Yes |
| Vector | O(n) | Full | Yes | No |
| ITC | O(log n)* | Full | Yes | Yes |
| HLC | O(1) | Partial | No | Yes |
*ITC space varies based on tree structure
Practical Applications
Conflict Detection (Vector Clocks):
if V1 < V2:
# v1 is ancestor of v2, no conflict
elif V1 > V2:
# v2 is ancestor of v1, no conflict
else: # V1 || V2
# Concurrent updates, need conflict resolution
Causal Broadcast:
Deliver message m with VC only when:
1. VC[sender] = local_VC[sender] + 1 (next expected from sender)
2. ∀j ≠ sender: VC[j] ≤ local_VC[j] (all causal deps satisfied)
Snapshot Algorithms:
Consistent cut: set of events S where
if e ∈ S and f → e, then f ∈ S
Vector clocks make this efficiently verifiable
References
For detailed protocol specifications and proofs, see:
references/consensus-protocols.md- Detailed protocol descriptionsreferences/consistency-models.md- Formal consistency definitionsreferences/failure-scenarios.md- Failure mode analysisreferences/logical-clocks.md- Clock algorithms and implementations