Added detailed pseudocode for elliptic curve algorithms covering modular arithmetic, point operations, scalar multiplication, and coordinate conversions. Also introduced a comprehensive knowledge base for distributed systems, including CAP theorem, consistency models, consensus protocols (e.g., Paxos, Raft, PBFT, Nakamoto), and fault-tolerant design principles.
14 KiB
Consensus Protocols - Detailed Reference
Complete specifications and implementation details for major consensus protocols.
Paxos Complete Specification
Proposal Numbers
Proposal numbers must be:
- Unique: No two proposers use the same number
- Totally ordered: Any two can be compared
Implementation: (round_number, proposer_id) where proposer_id breaks ties.
Single-Decree Paxos State
Proposer state:
proposal_number: int
value: any
Acceptor state (persistent):
highest_promised: int # Highest proposal number promised
accepted_proposal: int # Number of accepted proposal (0 if none)
accepted_value: any # Value of accepted proposal (null if none)
Message Format
Prepare (Phase 1a):
{
type: "PREPARE",
proposal_number: n
}
Promise (Phase 1b):
{
type: "PROMISE",
proposal_number: n,
accepted_proposal: m, # null if nothing accepted
accepted_value: v # null if nothing accepted
}
Accept (Phase 2a):
{
type: "ACCEPT",
proposal_number: n,
value: v
}
Accepted (Phase 2b):
{
type: "ACCEPTED",
proposal_number: n,
value: v
}
Proposer Algorithm
function propose(value):
n = generate_proposal_number()
# Phase 1: Prepare
promises = []
for acceptor in acceptors:
send PREPARE(n) to acceptor
wait until |promises| > |acceptors|/2 or timeout
if timeout:
return FAILED
# Choose value
highest = max(promises, key=p.accepted_proposal)
if highest.accepted_value is not null:
value = highest.accepted_value
# Phase 2: Accept
accepts = []
for acceptor in acceptors:
send ACCEPT(n, value) to acceptor
wait until |accepts| > |acceptors|/2 or timeout
if timeout:
return FAILED
return SUCCESS(value)
Acceptor Algorithm
on receive PREPARE(n):
if n > highest_promised:
highest_promised = n
persist(highest_promised)
reply PROMISE(n, accepted_proposal, accepted_value)
else:
# Optionally reply NACK(highest_promised)
ignore or reject
on receive ACCEPT(n, v):
if n >= highest_promised:
highest_promised = n
accepted_proposal = n
accepted_value = v
persist(highest_promised, accepted_proposal, accepted_value)
reply ACCEPTED(n, v)
else:
ignore or reject
Multi-Paxos Optimization
Stable leader:
# Leader election (using Paxos or other method)
leader = elect_leader()
# Leader's Phase 1 for all future instances
leader sends PREPARE(n) for instance range [i, ∞)
# For each command:
function propose_as_leader(value, instance):
# Skip Phase 1 if already leader
for acceptor in acceptors:
send ACCEPT(n, value, instance) to acceptor
wait for majority ACCEPTED
return SUCCESS
Paxos Safety Proof Sketch
Invariant: If a value v is chosen for instance i, no other value can be chosen.
Proof:
- Value chosen → accepted by majority with proposal n
- Any higher proposal n' must contact majority
- Majorities intersect → at least one acceptor has accepted v
- New proposer adopts v (or higher already-accepted value)
- By induction, all future proposals use v
Raft Complete Specification
State
All servers (persistent):
currentTerm: int # Latest term seen
votedFor: ServerId # Candidate voted for in current term (null if none)
log[]: LogEntry # Log entries
All servers (volatile):
commitIndex: int # Highest log index known to be committed
lastApplied: int # Highest log index applied to state machine
Leader (volatile, reinitialized after election):
nextIndex[]: int # For each server, next log index to send
matchIndex[]: int # For each server, highest log index replicated
LogEntry:
{
term: int,
command: any
}
RequestVote RPC
Request:
{
term: int, # Candidate's term
candidateId: ServerId, # Candidate requesting vote
lastLogIndex: int, # Index of candidate's last log entry
lastLogTerm: int # Term of candidate's last log entry
}
Response:
{
term: int, # currentTerm, for candidate to update itself
voteGranted: bool # True if candidate received vote
}
Receiver implementation:
on receive RequestVote(term, candidateId, lastLogIndex, lastLogTerm):
if term < currentTerm:
return {term: currentTerm, voteGranted: false}
if term > currentTerm:
currentTerm = term
votedFor = null
convert to follower
# Check if candidate's log is at least as up-to-date as ours
ourLastTerm = log[len(log)-1].term if log else 0
ourLastIndex = len(log) - 1
logOK = (lastLogTerm > ourLastTerm) or
(lastLogTerm == ourLastTerm and lastLogIndex >= ourLastIndex)
if (votedFor is null or votedFor == candidateId) and logOK:
votedFor = candidateId
persist(currentTerm, votedFor)
reset election timer
return {term: currentTerm, voteGranted: true}
return {term: currentTerm, voteGranted: false}
AppendEntries RPC
Request:
{
term: int, # Leader's term
leaderId: ServerId, # For follower to redirect clients
prevLogIndex: int, # Index of log entry preceding new ones
prevLogTerm: int, # Term of prevLogIndex entry
entries[]: LogEntry, # Log entries to store (empty for heartbeat)
leaderCommit: int # Leader's commitIndex
}
Response:
{
term: int, # currentTerm, for leader to update itself
success: bool # True if follower had matching prevLog entry
}
Receiver implementation:
on receive AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit):
if term < currentTerm:
return {term: currentTerm, success: false}
reset election timer
if term > currentTerm:
currentTerm = term
votedFor = null
convert to follower
# Check log consistency
if prevLogIndex >= len(log) or
(prevLogIndex >= 0 and log[prevLogIndex].term != prevLogTerm):
return {term: currentTerm, success: false}
# Append new entries (handling conflicts)
for i, entry in enumerate(entries):
index = prevLogIndex + 1 + i
if index < len(log):
if log[index].term != entry.term:
# Delete conflicting entry and all following
log = log[:index]
log.append(entry)
else:
log.append(entry)
persist(currentTerm, votedFor, log)
# Update commit index
if leaderCommit > commitIndex:
commitIndex = min(leaderCommit, len(log) - 1)
return {term: currentTerm, success: true}
Leader Behavior
on becoming leader:
for each server:
nextIndex[server] = len(log)
matchIndex[server] = 0
start sending heartbeats
on receiving client command:
append entry to local log
persist log
send AppendEntries to all followers
on receiving AppendEntries response from server:
if response.success:
matchIndex[server] = prevLogIndex + len(entries)
nextIndex[server] = matchIndex[server] + 1
# Update commit index
for N from commitIndex+1 to len(log)-1:
if log[N].term == currentTerm and
|{s : matchIndex[s] >= N}| > |servers|/2:
commitIndex = N
else:
nextIndex[server] = max(1, nextIndex[server] - 1)
retry AppendEntries with lower prevLogIndex
on commitIndex update:
while lastApplied < commitIndex:
lastApplied++
apply log[lastApplied].command to state machine
Election Timeout
on election timeout (follower or candidate):
currentTerm++
convert to candidate
votedFor = self
persist(currentTerm, votedFor)
reset election timer
votes = 1 # Vote for self
for each server except self:
send RequestVote(currentTerm, self, lastLogIndex, lastLogTerm)
wait for responses or timeout:
if received votes > |servers|/2:
become leader
if received AppendEntries from valid leader:
become follower
if timeout:
start new election
PBFT Complete Specification
Message Types
REQUEST:
{
type: "REQUEST",
operation: o, # Operation to execute
timestamp: t, # Client timestamp (for reply matching)
client: c # Client identifier
}
PRE-PREPARE:
{
type: "PRE-PREPARE",
view: v, # Current view number
sequence: n, # Sequence number
digest: d, # Hash of request
request: m # The request message
}
signature(primary)
PREPARE:
{
type: "PREPARE",
view: v,
sequence: n,
digest: d,
replica: i # Sending replica
}
signature(replica_i)
COMMIT:
{
type: "COMMIT",
view: v,
sequence: n,
digest: d,
replica: i
}
signature(replica_i)
REPLY:
{
type: "REPLY",
view: v,
timestamp: t,
client: c,
replica: i,
result: r # Execution result
}
signature(replica_i)
Replica State
view: int # Current view
sequence: int # Last assigned sequence number (primary)
log[]: {request, prepares, commits, state} # Log of requests
prepared_certificates: {} # Prepared certificates (2f+1 prepares)
committed_certificates: {} # Committed certificates (2f+1 commits)
h: int # Low water mark
H: int # High water mark (h + L)
Normal Operation Protocol
Primary (replica p = v mod n):
on receive REQUEST(m) from client:
if not primary for current view:
forward to primary
return
n = assign_sequence_number()
d = hash(m)
broadcast PRE-PREPARE(v, n, d, m) to all replicas
add to log
All replicas:
on receive PRE-PREPARE(v, n, d, m) from primary:
if v != current_view:
ignore
if already accepted pre-prepare for (v, n) with different digest:
ignore
if not in_view_as_backup(v):
ignore
if not h < n <= H:
ignore # Outside sequence window
# Valid pre-prepare
add to log
broadcast PREPARE(v, n, d, i) to all replicas
on receive PREPARE(v, n, d, j) from replica j:
if v != current_view:
ignore
add to log[n].prepares
if |log[n].prepares| >= 2f and not already_prepared(v, n, d):
# Prepared certificate complete
mark as prepared
broadcast COMMIT(v, n, d, i) to all replicas
on receive COMMIT(v, n, d, j) from replica j:
if v != current_view:
ignore
add to log[n].commits
if |log[n].commits| >= 2f + 1 and prepared(v, n, d):
# Committed certificate complete
if all entries < n are committed:
execute(m)
send REPLY(v, t, c, i, result) to client
View Change Protocol
Timeout trigger:
on request timeout (no progress):
view_change_timeout++
broadcast VIEW-CHANGE(v+1, n, C, P, i)
where:
n = last stable checkpoint sequence number
C = checkpoint certificate (2f+1 checkpoint messages)
P = set of prepared certificates for messages after n
VIEW-CHANGE:
{
type: "VIEW-CHANGE",
view: v, # New view number
sequence: n, # Checkpoint sequence
checkpoints: C, # Checkpoint certificate
prepared: P, # Set of prepared certificates
replica: i
}
signature(replica_i)
New primary (p' = v mod n):
on receive 2f VIEW-CHANGE for view v:
V = set of valid view-change messages
# Compute O: set of requests to re-propose
O = {}
for seq in max_checkpoint_seq(V) to max_seq(V):
if exists prepared certificate for seq in V:
O[seq] = request from certificate
else:
O[seq] = null-request # No-op
broadcast NEW-VIEW(v, V, O)
# Re-run protocol for requests in O
for seq, request in O:
if request != null:
send PRE-PREPARE(v, seq, hash(request), request)
NEW-VIEW:
{
type: "NEW-VIEW",
view: v,
view_changes: V, # 2f+1 view-change messages
pre_prepares: O # Set of pre-prepare messages
}
signature(primary)
Checkpointing
Periodic stable checkpoints to garbage collect logs:
every K requests:
state_hash = hash(state_machine_state)
broadcast CHECKPOINT(n, state_hash, i)
on receive 2f+1 CHECKPOINT for (n, d):
if all digests match:
create stable checkpoint
h = n # Move low water mark
garbage_collect(entries < n)
HotStuff Protocol
Linear complexity BFT using threshold signatures.
Key Innovation
- Three-phase: prepare → pre-commit → commit → decide
- Pipelining: Next proposal starts before current finishes
- Threshold signatures: O(n) total messages instead of O(n²)
Message Flow
Phase 1 (Prepare):
Leader: broadcast PREPARE(v, node)
Replicas: sign and send partial signature to leader
Leader: aggregate into prepare certificate QC
Phase 2 (Pre-commit):
Leader: broadcast PRE-COMMIT(v, QC_prepare)
Replicas: sign and send partial signature
Leader: aggregate into pre-commit certificate
Phase 3 (Commit):
Leader: broadcast COMMIT(v, QC_precommit)
Replicas: sign and send partial signature
Leader: aggregate into commit certificate
Phase 4 (Decide):
Leader: broadcast DECIDE(v, QC_commit)
Replicas: execute and commit
Pipelining
Block k: [prepare] [pre-commit] [commit] [decide]
Block k+1: [prepare] [pre-commit] [commit] [decide]
Block k+2: [prepare] [pre-commit] [commit] [decide]
Each phase of block k+1 piggybacks on messages for block k.
Protocol Comparison Matrix
| Feature | Paxos | Raft | PBFT | HotStuff |
|---|---|---|---|---|
| Fault model | Crash | Crash | Byzantine | Byzantine |
| Fault tolerance | f with 2f+1 | f with 2f+1 | f with 3f+1 | f with 3f+1 |
| Message complexity | O(n) | O(n) | O(n²) | O(n) |
| Leader required | No (helps) | Yes | Yes | Yes |
| Phases | 2 | 2 | 3 | 3 |
| View change | Complex | Simple | Complex | Simple |