next.orly.dev/.claude/skills/distributed-systems/references/consensus-protocols.md

# Consensus Protocols - Detailed Reference

Complete specifications and implementation details for major consensus protocols.

## Paxos Complete Specification

### Proposal Numbers

Proposal numbers must be:
- **Unique**: No two proposers use the same number
- **Totally ordered**: Any two can be compared

**Implementation**: `(round_number, proposer_id)` where proposer_id breaks ties.

### Single-Decree Paxos State

**Proposer state**:
```
proposal_number: int
value: any
```

**Acceptor state (persistent)**:
```
highest_promised: int    # Highest proposal number promised
accepted_proposal: int   # Number of accepted proposal (0 if none)
accepted_value: any      # Value of accepted proposal (null if none)
```

### Message Format

**Prepare** (Phase 1a):
```
{
  type: "PREPARE",
  proposal_number: n
}
```

**Promise** (Phase 1b):
```
{
  type: "PROMISE",
  proposal_number: n,
  accepted_proposal: m,    # null if nothing accepted
  accepted_value: v        # null if nothing accepted
}
```

**Accept** (Phase 2a):
```
{
  type: "ACCEPT",
  proposal_number: n,
  value: v
}
```

**Accepted** (Phase 2b):
```
{
  type: "ACCEPTED",
  proposal_number: n,
  value: v
}
```

### Proposer Algorithm

```
function propose(value):
    n = generate_proposal_number()

    # Phase 1: Prepare
    promises = []
    for acceptor in acceptors:
        send PREPARE(n) to acceptor

    wait until |promises| > |acceptors|/2 or timeout

    if timeout:
        return FAILED

    # Choose value
    highest = max(promises, key=p.accepted_proposal)
    if highest.accepted_value is not null:
        value = highest.accepted_value

    # Phase 2: Accept
    accepts = []
    for acceptor in acceptors:
        send ACCEPT(n, value) to acceptor

    wait until |accepts| > |acceptors|/2 or timeout

    if timeout:
        return FAILED

    return SUCCESS(value)
```

### Acceptor Algorithm

```
on receive PREPARE(n):
    if n > highest_promised:
        highest_promised = n
        persist(highest_promised)
        reply PROMISE(n, accepted_proposal, accepted_value)
    else:
        # Optionally reply NACK(highest_promised)
        ignore or reject

on receive ACCEPT(n, v):
    if n >= highest_promised:
        highest_promised = n
        accepted_proposal = n
        accepted_value = v
        persist(highest_promised, accepted_proposal, accepted_value)
        reply ACCEPTED(n, v)
    else:
        ignore or reject
```

### Multi-Paxos Optimization

**Stable leader**:
```
# Leader election (using Paxos or other method)
leader = elect_leader()

# Leader's Phase 1 for all future instances
leader sends PREPARE(n) for instance range [i, ∞)

# For each command:
function propose_as_leader(value, instance):
    # Skip Phase 1 if already leader
    for acceptor in acceptors:
        send ACCEPT(n, value, instance) to acceptor
    wait for majority ACCEPTED
    return SUCCESS
```

### Paxos Safety Proof Sketch

**Invariant**: If a value v is chosen for instance i, no other value can be chosen.

**Proof**:
1. Value chosen → accepted by majority with proposal n
2. Any higher proposal n' must contact majority
3. Majorities intersect → at least one acceptor has accepted v
4. New proposer adopts v (or higher already-accepted value)
5. By induction, all future proposals use v

## Raft Complete Specification

### State

**All servers (persistent)**:
```
currentTerm: int      # Latest term seen
votedFor: ServerId    # Candidate voted for in current term (null if none)
log[]: LogEntry       # Log entries
```

**All servers (volatile)**:
```
commitIndex: int      # Highest log index known to be committed
lastApplied: int      # Highest log index applied to state machine
```

**Leader (volatile, reinitialized after election)**:
```
nextIndex[]: int      # For each server, next log index to send
matchIndex[]: int     # For each server, highest log index replicated
```

**LogEntry**:
```
{
  term: int,
  command: any
}
```

### RequestVote RPC

**Request**:
```
{
  term: int,              # Candidate's term
  candidateId: ServerId,  # Candidate requesting vote
  lastLogIndex: int,      # Index of candidate's last log entry
  lastLogTerm: int        # Term of candidate's last log entry
}
```

**Response**:
```
{
  term: int,              # currentTerm, for candidate to update itself
  voteGranted: bool       # True if candidate received vote
}
```

**Receiver implementation**:
```
on receive RequestVote(term, candidateId, lastLogIndex, lastLogTerm):
    if term < currentTerm:
        return {term: currentTerm, voteGranted: false}

    if term > currentTerm:
        currentTerm = term
        votedFor = null
        convert to follower

    # Check if candidate's log is at least as up-to-date as ours
    ourLastTerm = log[len(log)-1].term if log else 0
    ourLastIndex = len(log) - 1

    logOK = (lastLogTerm > ourLastTerm) or
            (lastLogTerm == ourLastTerm and lastLogIndex >= ourLastIndex)

    if (votedFor is null or votedFor == candidateId) and logOK:
        votedFor = candidateId
        persist(currentTerm, votedFor)
        reset election timer
        return {term: currentTerm, voteGranted: true}

    return {term: currentTerm, voteGranted: false}
```

### AppendEntries RPC

**Request**:
```
{
  term: int,              # Leader's term
  leaderId: ServerId,     # For follower to redirect clients
  prevLogIndex: int,      # Index of log entry preceding new ones
  prevLogTerm: int,       # Term of prevLogIndex entry
  entries[]: LogEntry,    # Log entries to store (empty for heartbeat)
  leaderCommit: int       # Leader's commitIndex
}
```

**Response**:
```
{
  term: int,              # currentTerm, for leader to update itself
  success: bool           # True if follower had matching prevLog entry
}
```

**Receiver implementation**:
```
on receive AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit):
    if term < currentTerm:
        return {term: currentTerm, success: false}

    reset election timer

    if term > currentTerm:
        currentTerm = term
        votedFor = null

    convert to follower

    # Check log consistency
    if prevLogIndex >= len(log) or
       (prevLogIndex >= 0 and log[prevLogIndex].term != prevLogTerm):
        return {term: currentTerm, success: false}

    # Append new entries (handling conflicts)
    for i, entry in enumerate(entries):
        index = prevLogIndex + 1 + i
        if index < len(log):
            if log[index].term != entry.term:
                # Delete conflicting entry and all following
                log = log[:index]
                log.append(entry)
        else:
            log.append(entry)

    persist(currentTerm, votedFor, log)

    # Update commit index
    if leaderCommit > commitIndex:
        commitIndex = min(leaderCommit, len(log) - 1)

    return {term: currentTerm, success: true}
```

### Leader Behavior

```
on becoming leader:
    for each server:
        nextIndex[server] = len(log)
        matchIndex[server] = 0

    start sending heartbeats

on receiving client command:
    append entry to local log
    persist log
    send AppendEntries to all followers

on receiving AppendEntries response from server:
    if response.success:
        matchIndex[server] = prevLogIndex + len(entries)
        nextIndex[server] = matchIndex[server] + 1

        # Update commit index
        for N from commitIndex+1 to len(log)-1:
            if log[N].term == currentTerm and
               |{s : matchIndex[s] >= N}| > |servers|/2:
                commitIndex = N
    else:
        nextIndex[server] = max(1, nextIndex[server] - 1)
        retry AppendEntries with lower prevLogIndex

on commitIndex update:
    while lastApplied < commitIndex:
        lastApplied++
        apply log[lastApplied].command to state machine
```

### Election Timeout

```
on election timeout (follower or candidate):
    currentTerm++
    convert to candidate
    votedFor = self
    persist(currentTerm, votedFor)
    reset election timer
    votes = 1  # Vote for self

    for each server except self:
        send RequestVote(currentTerm, self, lastLogIndex, lastLogTerm)

    wait for responses or timeout:
        if received votes > |servers|/2:
            become leader
        if received AppendEntries from valid leader:
            become follower
        if timeout:
            start new election
```

## PBFT Complete Specification

### Message Types

**REQUEST**:
```
{
  type: "REQUEST",
  operation: o,           # Operation to execute
  timestamp: t,           # Client timestamp (for reply matching)
  client: c               # Client identifier
}
```

**PRE-PREPARE**:
```
{
  type: "PRE-PREPARE",
  view: v,                # Current view number
  sequence: n,            # Sequence number
  digest: d,              # Hash of request
  request: m              # The request message
}
signature(primary)
```

**PREPARE**:
```
{
  type: "PREPARE",
  view: v,
  sequence: n,
  digest: d,
  replica: i              # Sending replica
}
signature(replica_i)
```

**COMMIT**:
```
{
  type: "COMMIT",
  view: v,
  sequence: n,
  digest: d,
  replica: i
}
signature(replica_i)
```

**REPLY**:
```
{
  type: "REPLY",
  view: v,
  timestamp: t,
  client: c,
  replica: i,
  result: r               # Execution result
}
signature(replica_i)
```

### Replica State

```
view: int                       # Current view
sequence: int                   # Last assigned sequence number (primary)
log[]: {request, prepares, commits, state}  # Log of requests
prepared_certificates: {}       # Prepared certificates (2f+1 prepares)
committed_certificates: {}      # Committed certificates (2f+1 commits)
h: int                          # Low water mark
H: int                          # High water mark (h + L)
```

### Normal Operation Protocol

**Primary (replica p = v mod n)**:
```
on receive REQUEST(m) from client:
    if not primary for current view:
        forward to primary
        return

    n = assign_sequence_number()
    d = hash(m)

    broadcast PRE-PREPARE(v, n, d, m) to all replicas
    add to log
```

**All replicas**:
```
on receive PRE-PREPARE(v, n, d, m) from primary:
    if v != current_view:
        ignore
    if already accepted pre-prepare for (v, n) with different digest:
        ignore
    if not in_view_as_backup(v):
        ignore
    if not h < n <= H:
        ignore  # Outside sequence window

    # Valid pre-prepare
    add to log
    broadcast PREPARE(v, n, d, i) to all replicas

on receive PREPARE(v, n, d, j) from replica j:
    if v != current_view:
        ignore

    add to log[n].prepares

    if |log[n].prepares| >= 2f and not already_prepared(v, n, d):
        # Prepared certificate complete
        mark as prepared
        broadcast COMMIT(v, n, d, i) to all replicas

on receive COMMIT(v, n, d, j) from replica j:
    if v != current_view:
        ignore

    add to log[n].commits

    if |log[n].commits| >= 2f + 1 and prepared(v, n, d):
        # Committed certificate complete
        if all entries < n are committed:
            execute(m)
            send REPLY(v, t, c, i, result) to client
```

### View Change Protocol

**Timeout trigger**:
```
on request timeout (no progress):
    view_change_timeout++
    broadcast VIEW-CHANGE(v+1, n, C, P, i)

    where:
      n = last stable checkpoint sequence number
      C = checkpoint certificate (2f+1 checkpoint messages)
      P = set of prepared certificates for messages after n
```

**VIEW-CHANGE**:
```
{
  type: "VIEW-CHANGE",
  view: v,                      # New view number
  sequence: n,                  # Checkpoint sequence
  checkpoints: C,               # Checkpoint certificate
  prepared: P,                  # Set of prepared certificates
  replica: i
}
signature(replica_i)
```

**New primary (p' = v mod n)**:
```
on receive 2f VIEW-CHANGE for view v:
    V = set of valid view-change messages

    # Compute O: set of requests to re-propose
    O = {}
    for seq in max_checkpoint_seq(V) to max_seq(V):
        if exists prepared certificate for seq in V:
            O[seq] = request from certificate
        else:
            O[seq] = null-request  # No-op

    broadcast NEW-VIEW(v, V, O)

    # Re-run protocol for requests in O
    for seq, request in O:
        if request != null:
            send PRE-PREPARE(v, seq, hash(request), request)
```

**NEW-VIEW**:
```
{
  type: "NEW-VIEW",
  view: v,
  view_changes: V,              # 2f+1 view-change messages
  pre_prepares: O               # Set of pre-prepare messages
}
signature(primary)
```

### Checkpointing

Periodic stable checkpoints to garbage collect logs:

```
every K requests:
    state_hash = hash(state_machine_state)
    broadcast CHECKPOINT(n, state_hash, i)

on receive 2f+1 CHECKPOINT for (n, d):
    if all digests match:
        create stable checkpoint
        h = n  # Move low water mark
        garbage_collect(entries < n)
```

## HotStuff Protocol

Linear complexity BFT using threshold signatures.

### Key Innovation

- **Three-phase**: prepare → pre-commit → commit → decide
- **Pipelining**: Next proposal starts before current finishes
- **Threshold signatures**: O(n) total messages instead of O(n²)

### Message Flow

```
Phase 1 (Prepare):
  Leader: broadcast PREPARE(v, node)
  Replicas: sign and send partial signature to leader
  Leader: aggregate into prepare certificate QC

Phase 2 (Pre-commit):
  Leader: broadcast PRE-COMMIT(v, QC_prepare)
  Replicas: sign and send partial signature
  Leader: aggregate into pre-commit certificate

Phase 3 (Commit):
  Leader: broadcast COMMIT(v, QC_precommit)
  Replicas: sign and send partial signature
  Leader: aggregate into commit certificate

Phase 4 (Decide):
  Leader: broadcast DECIDE(v, QC_commit)
  Replicas: execute and commit
```

### Pipelining

```
Block k:   [prepare] [pre-commit] [commit] [decide]
Block k+1:          [prepare] [pre-commit] [commit] [decide]
Block k+2:                   [prepare] [pre-commit] [commit] [decide]
```

Each phase of block k+1 piggybacks on messages for block k.

## Protocol Comparison Matrix

| Feature | Paxos | Raft | PBFT | HotStuff |
|---------|-------|------|------|----------|
| Fault model | Crash | Crash | Byzantine | Byzantine |
| Fault tolerance | f with 2f+1 | f with 2f+1 | f with 3f+1 | f with 3f+1 |
| Message complexity | O(n) | O(n) | O(n²) | O(n) |
| Leader required | No (helps) | Yes | Yes | Yes |
| Phases | 2 | 2 | 3 | 3 |
| View change | Complex | Simple | Complex | Simple |