Files
next.orly.dev/.claude/skills/distributed-systems/references/consensus-protocols.md
mleku 3c17e975df Add foundational resources for elliptic curve operations and distributed systems
Added detailed pseudocode for elliptic curve algorithms covering modular arithmetic, point operations, scalar multiplication, and coordinate conversions. Also introduced a comprehensive knowledge base for distributed systems, including CAP theorem, consistency models, consensus protocols (e.g., Paxos, Raft, PBFT, Nakamoto), and fault-tolerant design principles.
2025-12-02 19:14:39 +00:00

14 KiB

Consensus Protocols - Detailed Reference

Complete specifications and implementation details for major consensus protocols.

Paxos Complete Specification

Proposal Numbers

Proposal numbers must be:

  • Unique: No two proposers use the same number
  • Totally ordered: Any two can be compared

Implementation: (round_number, proposer_id) where proposer_id breaks ties.

Single-Decree Paxos State

Proposer state:

proposal_number: int
value: any

Acceptor state (persistent):

highest_promised: int    # Highest proposal number promised
accepted_proposal: int   # Number of accepted proposal (0 if none)
accepted_value: any      # Value of accepted proposal (null if none)

Message Format

Prepare (Phase 1a):

{
  type: "PREPARE",
  proposal_number: n
}

Promise (Phase 1b):

{
  type: "PROMISE",
  proposal_number: n,
  accepted_proposal: m,    # null if nothing accepted
  accepted_value: v        # null if nothing accepted
}

Accept (Phase 2a):

{
  type: "ACCEPT",
  proposal_number: n,
  value: v
}

Accepted (Phase 2b):

{
  type: "ACCEPTED",
  proposal_number: n,
  value: v
}

Proposer Algorithm

function propose(value):
    n = generate_proposal_number()

    # Phase 1: Prepare
    promises = []
    for acceptor in acceptors:
        send PREPARE(n) to acceptor

    wait until |promises| > |acceptors|/2 or timeout

    if timeout:
        return FAILED

    # Choose value
    highest = max(promises, key=p.accepted_proposal)
    if highest.accepted_value is not null:
        value = highest.accepted_value

    # Phase 2: Accept
    accepts = []
    for acceptor in acceptors:
        send ACCEPT(n, value) to acceptor

    wait until |accepts| > |acceptors|/2 or timeout

    if timeout:
        return FAILED

    return SUCCESS(value)

Acceptor Algorithm

on receive PREPARE(n):
    if n > highest_promised:
        highest_promised = n
        persist(highest_promised)
        reply PROMISE(n, accepted_proposal, accepted_value)
    else:
        # Optionally reply NACK(highest_promised)
        ignore or reject

on receive ACCEPT(n, v):
    if n >= highest_promised:
        highest_promised = n
        accepted_proposal = n
        accepted_value = v
        persist(highest_promised, accepted_proposal, accepted_value)
        reply ACCEPTED(n, v)
    else:
        ignore or reject

Multi-Paxos Optimization

Stable leader:

# Leader election (using Paxos or other method)
leader = elect_leader()

# Leader's Phase 1 for all future instances
leader sends PREPARE(n) for instance range [i, ∞)

# For each command:
function propose_as_leader(value, instance):
    # Skip Phase 1 if already leader
    for acceptor in acceptors:
        send ACCEPT(n, value, instance) to acceptor
    wait for majority ACCEPTED
    return SUCCESS

Paxos Safety Proof Sketch

Invariant: If a value v is chosen for instance i, no other value can be chosen.

Proof:

  1. Value chosen → accepted by majority with proposal n
  2. Any higher proposal n' must contact majority
  3. Majorities intersect → at least one acceptor has accepted v
  4. New proposer adopts v (or higher already-accepted value)
  5. By induction, all future proposals use v

Raft Complete Specification

State

All servers (persistent):

currentTerm: int      # Latest term seen
votedFor: ServerId    # Candidate voted for in current term (null if none)
log[]: LogEntry       # Log entries

All servers (volatile):

commitIndex: int      # Highest log index known to be committed
lastApplied: int      # Highest log index applied to state machine

Leader (volatile, reinitialized after election):

nextIndex[]: int      # For each server, next log index to send
matchIndex[]: int     # For each server, highest log index replicated

LogEntry:

{
  term: int,
  command: any
}

RequestVote RPC

Request:

{
  term: int,              # Candidate's term
  candidateId: ServerId,  # Candidate requesting vote
  lastLogIndex: int,      # Index of candidate's last log entry
  lastLogTerm: int        # Term of candidate's last log entry
}

Response:

{
  term: int,              # currentTerm, for candidate to update itself
  voteGranted: bool       # True if candidate received vote
}

Receiver implementation:

on receive RequestVote(term, candidateId, lastLogIndex, lastLogTerm):
    if term < currentTerm:
        return {term: currentTerm, voteGranted: false}

    if term > currentTerm:
        currentTerm = term
        votedFor = null
        convert to follower

    # Check if candidate's log is at least as up-to-date as ours
    ourLastTerm = log[len(log)-1].term if log else 0
    ourLastIndex = len(log) - 1

    logOK = (lastLogTerm > ourLastTerm) or
            (lastLogTerm == ourLastTerm and lastLogIndex >= ourLastIndex)

    if (votedFor is null or votedFor == candidateId) and logOK:
        votedFor = candidateId
        persist(currentTerm, votedFor)
        reset election timer
        return {term: currentTerm, voteGranted: true}

    return {term: currentTerm, voteGranted: false}

AppendEntries RPC

Request:

{
  term: int,              # Leader's term
  leaderId: ServerId,     # For follower to redirect clients
  prevLogIndex: int,      # Index of log entry preceding new ones
  prevLogTerm: int,       # Term of prevLogIndex entry
  entries[]: LogEntry,    # Log entries to store (empty for heartbeat)
  leaderCommit: int       # Leader's commitIndex
}

Response:

{
  term: int,              # currentTerm, for leader to update itself
  success: bool           # True if follower had matching prevLog entry
}

Receiver implementation:

on receive AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit):
    if term < currentTerm:
        return {term: currentTerm, success: false}

    reset election timer

    if term > currentTerm:
        currentTerm = term
        votedFor = null

    convert to follower

    # Check log consistency
    if prevLogIndex >= len(log) or
       (prevLogIndex >= 0 and log[prevLogIndex].term != prevLogTerm):
        return {term: currentTerm, success: false}

    # Append new entries (handling conflicts)
    for i, entry in enumerate(entries):
        index = prevLogIndex + 1 + i
        if index < len(log):
            if log[index].term != entry.term:
                # Delete conflicting entry and all following
                log = log[:index]
                log.append(entry)
        else:
            log.append(entry)

    persist(currentTerm, votedFor, log)

    # Update commit index
    if leaderCommit > commitIndex:
        commitIndex = min(leaderCommit, len(log) - 1)

    return {term: currentTerm, success: true}

Leader Behavior

on becoming leader:
    for each server:
        nextIndex[server] = len(log)
        matchIndex[server] = 0

    start sending heartbeats

on receiving client command:
    append entry to local log
    persist log
    send AppendEntries to all followers

on receiving AppendEntries response from server:
    if response.success:
        matchIndex[server] = prevLogIndex + len(entries)
        nextIndex[server] = matchIndex[server] + 1

        # Update commit index
        for N from commitIndex+1 to len(log)-1:
            if log[N].term == currentTerm and
               |{s : matchIndex[s] >= N}| > |servers|/2:
                commitIndex = N
    else:
        nextIndex[server] = max(1, nextIndex[server] - 1)
        retry AppendEntries with lower prevLogIndex

on commitIndex update:
    while lastApplied < commitIndex:
        lastApplied++
        apply log[lastApplied].command to state machine

Election Timeout

on election timeout (follower or candidate):
    currentTerm++
    convert to candidate
    votedFor = self
    persist(currentTerm, votedFor)
    reset election timer
    votes = 1  # Vote for self

    for each server except self:
        send RequestVote(currentTerm, self, lastLogIndex, lastLogTerm)

    wait for responses or timeout:
        if received votes > |servers|/2:
            become leader
        if received AppendEntries from valid leader:
            become follower
        if timeout:
            start new election

PBFT Complete Specification

Message Types

REQUEST:

{
  type: "REQUEST",
  operation: o,           # Operation to execute
  timestamp: t,           # Client timestamp (for reply matching)
  client: c               # Client identifier
}

PRE-PREPARE:

{
  type: "PRE-PREPARE",
  view: v,                # Current view number
  sequence: n,            # Sequence number
  digest: d,              # Hash of request
  request: m              # The request message
}
signature(primary)

PREPARE:

{
  type: "PREPARE",
  view: v,
  sequence: n,
  digest: d,
  replica: i              # Sending replica
}
signature(replica_i)

COMMIT:

{
  type: "COMMIT",
  view: v,
  sequence: n,
  digest: d,
  replica: i
}
signature(replica_i)

REPLY:

{
  type: "REPLY",
  view: v,
  timestamp: t,
  client: c,
  replica: i,
  result: r               # Execution result
}
signature(replica_i)

Replica State

view: int                       # Current view
sequence: int                   # Last assigned sequence number (primary)
log[]: {request, prepares, commits, state}  # Log of requests
prepared_certificates: {}       # Prepared certificates (2f+1 prepares)
committed_certificates: {}      # Committed certificates (2f+1 commits)
h: int                          # Low water mark
H: int                          # High water mark (h + L)

Normal Operation Protocol

Primary (replica p = v mod n):

on receive REQUEST(m) from client:
    if not primary for current view:
        forward to primary
        return

    n = assign_sequence_number()
    d = hash(m)

    broadcast PRE-PREPARE(v, n, d, m) to all replicas
    add to log

All replicas:

on receive PRE-PREPARE(v, n, d, m) from primary:
    if v != current_view:
        ignore
    if already accepted pre-prepare for (v, n) with different digest:
        ignore
    if not in_view_as_backup(v):
        ignore
    if not h < n <= H:
        ignore  # Outside sequence window

    # Valid pre-prepare
    add to log
    broadcast PREPARE(v, n, d, i) to all replicas

on receive PREPARE(v, n, d, j) from replica j:
    if v != current_view:
        ignore

    add to log[n].prepares

    if |log[n].prepares| >= 2f and not already_prepared(v, n, d):
        # Prepared certificate complete
        mark as prepared
        broadcast COMMIT(v, n, d, i) to all replicas

on receive COMMIT(v, n, d, j) from replica j:
    if v != current_view:
        ignore

    add to log[n].commits

    if |log[n].commits| >= 2f + 1 and prepared(v, n, d):
        # Committed certificate complete
        if all entries < n are committed:
            execute(m)
            send REPLY(v, t, c, i, result) to client

View Change Protocol

Timeout trigger:

on request timeout (no progress):
    view_change_timeout++
    broadcast VIEW-CHANGE(v+1, n, C, P, i)

    where:
      n = last stable checkpoint sequence number
      C = checkpoint certificate (2f+1 checkpoint messages)
      P = set of prepared certificates for messages after n

VIEW-CHANGE:

{
  type: "VIEW-CHANGE",
  view: v,                      # New view number
  sequence: n,                  # Checkpoint sequence
  checkpoints: C,               # Checkpoint certificate
  prepared: P,                  # Set of prepared certificates
  replica: i
}
signature(replica_i)

New primary (p' = v mod n):

on receive 2f VIEW-CHANGE for view v:
    V = set of valid view-change messages

    # Compute O: set of requests to re-propose
    O = {}
    for seq in max_checkpoint_seq(V) to max_seq(V):
        if exists prepared certificate for seq in V:
            O[seq] = request from certificate
        else:
            O[seq] = null-request  # No-op

    broadcast NEW-VIEW(v, V, O)

    # Re-run protocol for requests in O
    for seq, request in O:
        if request != null:
            send PRE-PREPARE(v, seq, hash(request), request)

NEW-VIEW:

{
  type: "NEW-VIEW",
  view: v,
  view_changes: V,              # 2f+1 view-change messages
  pre_prepares: O               # Set of pre-prepare messages
}
signature(primary)

Checkpointing

Periodic stable checkpoints to garbage collect logs:

every K requests:
    state_hash = hash(state_machine_state)
    broadcast CHECKPOINT(n, state_hash, i)

on receive 2f+1 CHECKPOINT for (n, d):
    if all digests match:
        create stable checkpoint
        h = n  # Move low water mark
        garbage_collect(entries < n)

HotStuff Protocol

Linear complexity BFT using threshold signatures.

Key Innovation

  • Three-phase: prepare → pre-commit → commit → decide
  • Pipelining: Next proposal starts before current finishes
  • Threshold signatures: O(n) total messages instead of O(n²)

Message Flow

Phase 1 (Prepare):
  Leader: broadcast PREPARE(v, node)
  Replicas: sign and send partial signature to leader
  Leader: aggregate into prepare certificate QC

Phase 2 (Pre-commit):
  Leader: broadcast PRE-COMMIT(v, QC_prepare)
  Replicas: sign and send partial signature
  Leader: aggregate into pre-commit certificate

Phase 3 (Commit):
  Leader: broadcast COMMIT(v, QC_precommit)
  Replicas: sign and send partial signature
  Leader: aggregate into commit certificate

Phase 4 (Decide):
  Leader: broadcast DECIDE(v, QC_commit)
  Replicas: execute and commit

Pipelining

Block k:   [prepare] [pre-commit] [commit] [decide]
Block k+1:          [prepare] [pre-commit] [commit] [decide]
Block k+2:                   [prepare] [pre-commit] [commit] [decide]

Each phase of block k+1 piggybacks on messages for block k.

Protocol Comparison Matrix

Feature Paxos Raft PBFT HotStuff
Fault model Crash Crash Byzantine Byzantine
Fault tolerance f with 2f+1 f with 2f+1 f with 3f+1 f with 3f+1
Message complexity O(n) O(n) O(n²) O(n)
Leader required No (helps) Yes Yes Yes
Phases 2 2 3 3
View change Complex Simple Complex Simple