# Consensus Protocols - Detailed Reference Complete specifications and implementation details for major consensus protocols. ## Paxos Complete Specification ### Proposal Numbers Proposal numbers must be: - **Unique**: No two proposers use the same number - **Totally ordered**: Any two can be compared **Implementation**: `(round_number, proposer_id)` where proposer_id breaks ties. ### Single-Decree Paxos State **Proposer state**: ``` proposal_number: int value: any ``` **Acceptor state (persistent)**: ``` highest_promised: int # Highest proposal number promised accepted_proposal: int # Number of accepted proposal (0 if none) accepted_value: any # Value of accepted proposal (null if none) ``` ### Message Format **Prepare** (Phase 1a): ``` { type: "PREPARE", proposal_number: n } ``` **Promise** (Phase 1b): ``` { type: "PROMISE", proposal_number: n, accepted_proposal: m, # null if nothing accepted accepted_value: v # null if nothing accepted } ``` **Accept** (Phase 2a): ``` { type: "ACCEPT", proposal_number: n, value: v } ``` **Accepted** (Phase 2b): ``` { type: "ACCEPTED", proposal_number: n, value: v } ``` ### Proposer Algorithm ``` function propose(value): n = generate_proposal_number() # Phase 1: Prepare promises = [] for acceptor in acceptors: send PREPARE(n) to acceptor wait until |promises| > |acceptors|/2 or timeout if timeout: return FAILED # Choose value highest = max(promises, key=p.accepted_proposal) if highest.accepted_value is not null: value = highest.accepted_value # Phase 2: Accept accepts = [] for acceptor in acceptors: send ACCEPT(n, value) to acceptor wait until |accepts| > |acceptors|/2 or timeout if timeout: return FAILED return SUCCESS(value) ``` ### Acceptor Algorithm ``` on receive PREPARE(n): if n > highest_promised: highest_promised = n persist(highest_promised) reply PROMISE(n, accepted_proposal, accepted_value) else: # Optionally reply NACK(highest_promised) ignore or reject on receive ACCEPT(n, v): if n >= highest_promised: highest_promised = n accepted_proposal = n accepted_value = v persist(highest_promised, accepted_proposal, accepted_value) reply ACCEPTED(n, v) else: ignore or reject ``` ### Multi-Paxos Optimization **Stable leader**: ``` # Leader election (using Paxos or other method) leader = elect_leader() # Leader's Phase 1 for all future instances leader sends PREPARE(n) for instance range [i, ∞) # For each command: function propose_as_leader(value, instance): # Skip Phase 1 if already leader for acceptor in acceptors: send ACCEPT(n, value, instance) to acceptor wait for majority ACCEPTED return SUCCESS ``` ### Paxos Safety Proof Sketch **Invariant**: If a value v is chosen for instance i, no other value can be chosen. **Proof**: 1. Value chosen → accepted by majority with proposal n 2. Any higher proposal n' must contact majority 3. Majorities intersect → at least one acceptor has accepted v 4. New proposer adopts v (or higher already-accepted value) 5. By induction, all future proposals use v ## Raft Complete Specification ### State **All servers (persistent)**: ``` currentTerm: int # Latest term seen votedFor: ServerId # Candidate voted for in current term (null if none) log[]: LogEntry # Log entries ``` **All servers (volatile)**: ``` commitIndex: int # Highest log index known to be committed lastApplied: int # Highest log index applied to state machine ``` **Leader (volatile, reinitialized after election)**: ``` nextIndex[]: int # For each server, next log index to send matchIndex[]: int # For each server, highest log index replicated ``` **LogEntry**: ``` { term: int, command: any } ``` ### RequestVote RPC **Request**: ``` { term: int, # Candidate's term candidateId: ServerId, # Candidate requesting vote lastLogIndex: int, # Index of candidate's last log entry lastLogTerm: int # Term of candidate's last log entry } ``` **Response**: ``` { term: int, # currentTerm, for candidate to update itself voteGranted: bool # True if candidate received vote } ``` **Receiver implementation**: ``` on receive RequestVote(term, candidateId, lastLogIndex, lastLogTerm): if term < currentTerm: return {term: currentTerm, voteGranted: false} if term > currentTerm: currentTerm = term votedFor = null convert to follower # Check if candidate's log is at least as up-to-date as ours ourLastTerm = log[len(log)-1].term if log else 0 ourLastIndex = len(log) - 1 logOK = (lastLogTerm > ourLastTerm) or (lastLogTerm == ourLastTerm and lastLogIndex >= ourLastIndex) if (votedFor is null or votedFor == candidateId) and logOK: votedFor = candidateId persist(currentTerm, votedFor) reset election timer return {term: currentTerm, voteGranted: true} return {term: currentTerm, voteGranted: false} ``` ### AppendEntries RPC **Request**: ``` { term: int, # Leader's term leaderId: ServerId, # For follower to redirect clients prevLogIndex: int, # Index of log entry preceding new ones prevLogTerm: int, # Term of prevLogIndex entry entries[]: LogEntry, # Log entries to store (empty for heartbeat) leaderCommit: int # Leader's commitIndex } ``` **Response**: ``` { term: int, # currentTerm, for leader to update itself success: bool # True if follower had matching prevLog entry } ``` **Receiver implementation**: ``` on receive AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit): if term < currentTerm: return {term: currentTerm, success: false} reset election timer if term > currentTerm: currentTerm = term votedFor = null convert to follower # Check log consistency if prevLogIndex >= len(log) or (prevLogIndex >= 0 and log[prevLogIndex].term != prevLogTerm): return {term: currentTerm, success: false} # Append new entries (handling conflicts) for i, entry in enumerate(entries): index = prevLogIndex + 1 + i if index < len(log): if log[index].term != entry.term: # Delete conflicting entry and all following log = log[:index] log.append(entry) else: log.append(entry) persist(currentTerm, votedFor, log) # Update commit index if leaderCommit > commitIndex: commitIndex = min(leaderCommit, len(log) - 1) return {term: currentTerm, success: true} ``` ### Leader Behavior ``` on becoming leader: for each server: nextIndex[server] = len(log) matchIndex[server] = 0 start sending heartbeats on receiving client command: append entry to local log persist log send AppendEntries to all followers on receiving AppendEntries response from server: if response.success: matchIndex[server] = prevLogIndex + len(entries) nextIndex[server] = matchIndex[server] + 1 # Update commit index for N from commitIndex+1 to len(log)-1: if log[N].term == currentTerm and |{s : matchIndex[s] >= N}| > |servers|/2: commitIndex = N else: nextIndex[server] = max(1, nextIndex[server] - 1) retry AppendEntries with lower prevLogIndex on commitIndex update: while lastApplied < commitIndex: lastApplied++ apply log[lastApplied].command to state machine ``` ### Election Timeout ``` on election timeout (follower or candidate): currentTerm++ convert to candidate votedFor = self persist(currentTerm, votedFor) reset election timer votes = 1 # Vote for self for each server except self: send RequestVote(currentTerm, self, lastLogIndex, lastLogTerm) wait for responses or timeout: if received votes > |servers|/2: become leader if received AppendEntries from valid leader: become follower if timeout: start new election ``` ## PBFT Complete Specification ### Message Types **REQUEST**: ``` { type: "REQUEST", operation: o, # Operation to execute timestamp: t, # Client timestamp (for reply matching) client: c # Client identifier } ``` **PRE-PREPARE**: ``` { type: "PRE-PREPARE", view: v, # Current view number sequence: n, # Sequence number digest: d, # Hash of request request: m # The request message } signature(primary) ``` **PREPARE**: ``` { type: "PREPARE", view: v, sequence: n, digest: d, replica: i # Sending replica } signature(replica_i) ``` **COMMIT**: ``` { type: "COMMIT", view: v, sequence: n, digest: d, replica: i } signature(replica_i) ``` **REPLY**: ``` { type: "REPLY", view: v, timestamp: t, client: c, replica: i, result: r # Execution result } signature(replica_i) ``` ### Replica State ``` view: int # Current view sequence: int # Last assigned sequence number (primary) log[]: {request, prepares, commits, state} # Log of requests prepared_certificates: {} # Prepared certificates (2f+1 prepares) committed_certificates: {} # Committed certificates (2f+1 commits) h: int # Low water mark H: int # High water mark (h + L) ``` ### Normal Operation Protocol **Primary (replica p = v mod n)**: ``` on receive REQUEST(m) from client: if not primary for current view: forward to primary return n = assign_sequence_number() d = hash(m) broadcast PRE-PREPARE(v, n, d, m) to all replicas add to log ``` **All replicas**: ``` on receive PRE-PREPARE(v, n, d, m) from primary: if v != current_view: ignore if already accepted pre-prepare for (v, n) with different digest: ignore if not in_view_as_backup(v): ignore if not h < n <= H: ignore # Outside sequence window # Valid pre-prepare add to log broadcast PREPARE(v, n, d, i) to all replicas on receive PREPARE(v, n, d, j) from replica j: if v != current_view: ignore add to log[n].prepares if |log[n].prepares| >= 2f and not already_prepared(v, n, d): # Prepared certificate complete mark as prepared broadcast COMMIT(v, n, d, i) to all replicas on receive COMMIT(v, n, d, j) from replica j: if v != current_view: ignore add to log[n].commits if |log[n].commits| >= 2f + 1 and prepared(v, n, d): # Committed certificate complete if all entries < n are committed: execute(m) send REPLY(v, t, c, i, result) to client ``` ### View Change Protocol **Timeout trigger**: ``` on request timeout (no progress): view_change_timeout++ broadcast VIEW-CHANGE(v+1, n, C, P, i) where: n = last stable checkpoint sequence number C = checkpoint certificate (2f+1 checkpoint messages) P = set of prepared certificates for messages after n ``` **VIEW-CHANGE**: ``` { type: "VIEW-CHANGE", view: v, # New view number sequence: n, # Checkpoint sequence checkpoints: C, # Checkpoint certificate prepared: P, # Set of prepared certificates replica: i } signature(replica_i) ``` **New primary (p' = v mod n)**: ``` on receive 2f VIEW-CHANGE for view v: V = set of valid view-change messages # Compute O: set of requests to re-propose O = {} for seq in max_checkpoint_seq(V) to max_seq(V): if exists prepared certificate for seq in V: O[seq] = request from certificate else: O[seq] = null-request # No-op broadcast NEW-VIEW(v, V, O) # Re-run protocol for requests in O for seq, request in O: if request != null: send PRE-PREPARE(v, seq, hash(request), request) ``` **NEW-VIEW**: ``` { type: "NEW-VIEW", view: v, view_changes: V, # 2f+1 view-change messages pre_prepares: O # Set of pre-prepare messages } signature(primary) ``` ### Checkpointing Periodic stable checkpoints to garbage collect logs: ``` every K requests: state_hash = hash(state_machine_state) broadcast CHECKPOINT(n, state_hash, i) on receive 2f+1 CHECKPOINT for (n, d): if all digests match: create stable checkpoint h = n # Move low water mark garbage_collect(entries < n) ``` ## HotStuff Protocol Linear complexity BFT using threshold signatures. ### Key Innovation - **Three-phase**: prepare → pre-commit → commit → decide - **Pipelining**: Next proposal starts before current finishes - **Threshold signatures**: O(n) total messages instead of O(n²) ### Message Flow ``` Phase 1 (Prepare): Leader: broadcast PREPARE(v, node) Replicas: sign and send partial signature to leader Leader: aggregate into prepare certificate QC Phase 2 (Pre-commit): Leader: broadcast PRE-COMMIT(v, QC_prepare) Replicas: sign and send partial signature Leader: aggregate into pre-commit certificate Phase 3 (Commit): Leader: broadcast COMMIT(v, QC_precommit) Replicas: sign and send partial signature Leader: aggregate into commit certificate Phase 4 (Decide): Leader: broadcast DECIDE(v, QC_commit) Replicas: execute and commit ``` ### Pipelining ``` Block k: [prepare] [pre-commit] [commit] [decide] Block k+1: [prepare] [pre-commit] [commit] [decide] Block k+2: [prepare] [pre-commit] [commit] [decide] ``` Each phase of block k+1 piggybacks on messages for block k. ## Protocol Comparison Matrix | Feature | Paxos | Raft | PBFT | HotStuff | |---------|-------|------|------|----------| | Fault model | Crash | Crash | Byzantine | Byzantine | | Fault tolerance | f with 2f+1 | f with 2f+1 | f with 3f+1 | f with 3f+1 | | Message complexity | O(n) | O(n) | O(n²) | O(n) | | Leader required | No (helps) | Yes | Yes | Yes | | Phases | 2 | 2 | 3 | 3 | | View change | Complex | Simple | Complex | Simple |