Added detailed pseudocode for elliptic curve algorithms covering modular arithmetic, point operations, scalar multiplication, and coordinate conversions. Also introduced a comprehensive knowledge base for distributed systems, including CAP theorem, consistency models, consensus protocols (e.g., Paxos, Raft, PBFT, Nakamoto), and fault-tolerant design principles.
1116 lines
29 KiB
Markdown
1116 lines
29 KiB
Markdown
---
|
||
name: distributed-systems
|
||
description: This skill should be used when designing or implementing distributed systems, understanding consensus protocols (Paxos, Raft, PBFT, Nakamoto, PnyxDB), analyzing CAP theorem trade-offs, implementing logical clocks (Lamport, Vector, ITC), or building fault-tolerant architectures. Provides comprehensive knowledge of consensus algorithms, Byzantine fault tolerance, adversarial oracle protocols, replication strategies, causality tracking, and distributed system design principles.
|
||
---
|
||
|
||
# Distributed Systems
|
||
|
||
This skill provides deep knowledge of distributed systems design, consensus protocols, fault tolerance, and the fundamental trade-offs in building reliable distributed architectures.
|
||
|
||
## When to Use This Skill
|
||
|
||
- Designing distributed databases or storage systems
|
||
- Implementing consensus protocols (Raft, Paxos, PBFT, Nakamoto, PnyxDB)
|
||
- Analyzing system trade-offs using CAP theorem
|
||
- Building fault-tolerant or Byzantine fault-tolerant systems
|
||
- Understanding replication and consistency models
|
||
- Implementing causality tracking with logical clocks
|
||
- Building blockchain consensus mechanisms
|
||
- Designing decentralized oracle systems
|
||
- Understanding adversarial attack vectors in distributed systems
|
||
|
||
## CAP Theorem
|
||
|
||
### The Fundamental Trade-off
|
||
|
||
The CAP theorem, introduced by Eric Brewer in 2000, states that a distributed data store cannot simultaneously provide more than two of:
|
||
|
||
1. **Consistency (C)**: Every read receives the most recent write or an error
|
||
2. **Availability (A)**: Every request receives a non-error response (without guarantee of most recent data)
|
||
3. **Partition Tolerance (P)**: System continues operating despite network partitions
|
||
|
||
### Why P is Non-Negotiable
|
||
|
||
In any distributed system over a network:
|
||
- Network partitions **will** occur (cable cuts, router failures, congestion)
|
||
- A system that isn't partition-tolerant isn't truly distributed
|
||
- The real choice is between **CP** and **AP** during partitions
|
||
|
||
### System Classifications
|
||
|
||
#### CP Systems (Consistency + Partition Tolerance)
|
||
|
||
**Behavior during partition**: Refuses some requests to maintain consistency.
|
||
|
||
**Examples**:
|
||
- MongoDB (with majority write concern)
|
||
- HBase
|
||
- Zookeeper
|
||
- etcd
|
||
|
||
**Use when**:
|
||
- Correctness is paramount (financial systems)
|
||
- Stale reads are unacceptable
|
||
- Brief unavailability is tolerable
|
||
|
||
#### AP Systems (Availability + Partition Tolerance)
|
||
|
||
**Behavior during partition**: Continues serving requests, may return stale data.
|
||
|
||
**Examples**:
|
||
- Cassandra
|
||
- DynamoDB
|
||
- CouchDB
|
||
- Riak
|
||
|
||
**Use when**:
|
||
- High availability is critical
|
||
- Eventual consistency is acceptable
|
||
- Shopping carts, social media feeds
|
||
|
||
#### CA Systems
|
||
|
||
**Theoretical only**: Cannot exist in distributed systems because partitions are inevitable.
|
||
|
||
Single-node databases are technically CA but aren't distributed.
|
||
|
||
### PACELC Extension
|
||
|
||
PACELC extends CAP to address normal operation:
|
||
|
||
> If there is a **P**artition, choose between **A**vailability and **C**onsistency.
|
||
> **E**lse (normal operation), choose between **L**atency and **C**onsistency.
|
||
|
||
| System | P: A or C | E: L or C |
|
||
|--------|-----------|-----------|
|
||
| DynamoDB | A | L |
|
||
| Cassandra | A | L |
|
||
| MongoDB | C | C |
|
||
| PNUTS | C | L |
|
||
|
||
## Consistency Models
|
||
|
||
### Strong Consistency
|
||
|
||
Every read returns the most recent write. Achieved through:
|
||
- Single leader with synchronous replication
|
||
- Consensus protocols (Paxos, Raft)
|
||
|
||
**Trade-off**: Higher latency, lower availability during failures.
|
||
|
||
### Eventual Consistency
|
||
|
||
If no new updates, all replicas eventually converge to the same state.
|
||
|
||
**Variants**:
|
||
- **Causal consistency**: Preserves causally related operations order
|
||
- **Read-your-writes**: Clients see their own writes
|
||
- **Monotonic reads**: Never see older data after seeing newer
|
||
- **Session consistency**: Consistency within a session
|
||
|
||
### Linearizability
|
||
|
||
Operations appear instantaneous at some point between invocation and response.
|
||
|
||
**Provides**:
|
||
- Single-object operations appear atomic
|
||
- Real-time ordering guarantees
|
||
- Foundation for distributed locks, leader election
|
||
|
||
### Serializability
|
||
|
||
Transactions appear to execute in some serial order.
|
||
|
||
**Note**: Linearizability ≠ Serializability
|
||
- Linearizability: Single-operation recency guarantee
|
||
- Serializability: Multi-operation isolation guarantee
|
||
|
||
## Consensus Protocols
|
||
|
||
### The Consensus Problem
|
||
|
||
Getting distributed nodes to agree on a single value despite failures.
|
||
|
||
**Requirements**:
|
||
1. **Agreement**: All correct nodes decide on the same value
|
||
2. **Validity**: Decided value was proposed by some node
|
||
3. **Termination**: All correct nodes eventually decide
|
||
|
||
### Paxos
|
||
|
||
Developed by Leslie Lamport (1989/1998), foundational consensus algorithm.
|
||
|
||
#### Roles
|
||
|
||
- **Proposers**: Propose values
|
||
- **Acceptors**: Vote on proposals
|
||
- **Learners**: Learn decided values
|
||
|
||
#### Basic Protocol (Single-Decree)
|
||
|
||
**Phase 1a: Prepare**
|
||
```
|
||
Proposer → Acceptors: PREPARE(n)
|
||
- n is unique proposal number
|
||
```
|
||
|
||
**Phase 1b: Promise**
|
||
```
|
||
Acceptor → Proposer: PROMISE(n, accepted_proposal)
|
||
- If n > highest_seen: promise to ignore lower proposals
|
||
- Return previously accepted proposal if any
|
||
```
|
||
|
||
**Phase 2a: Accept**
|
||
```
|
||
Proposer → Acceptors: ACCEPT(n, v)
|
||
- v = value from highest accepted proposal, or proposer's own value
|
||
```
|
||
|
||
**Phase 2b: Accepted**
|
||
```
|
||
Acceptor → Learners: ACCEPTED(n, v)
|
||
- If n >= highest_promised: accept the proposal
|
||
```
|
||
|
||
**Decision**: Value is decided when majority of acceptors accept it.
|
||
|
||
#### Multi-Paxos
|
||
|
||
Optimization for sequences of values:
|
||
- Elect stable leader
|
||
- Skip Phase 1 for subsequent proposals
|
||
- Significantly reduces message complexity
|
||
|
||
#### Strengths and Weaknesses
|
||
|
||
**Strengths**:
|
||
- Proven correct
|
||
- Tolerates f failures with 2f+1 nodes
|
||
- Foundation for many systems
|
||
|
||
**Weaknesses**:
|
||
- Complex to implement correctly
|
||
- No specified leader election
|
||
- Performance requires Multi-Paxos optimizations
|
||
|
||
### Raft
|
||
|
||
Designed by Diego Ongaro and John Ousterhout (2013) for understandability.
|
||
|
||
#### Key Design Principles
|
||
|
||
1. **Decomposition**: Separates leader election, log replication, safety
|
||
2. **State reduction**: Minimizes states to consider
|
||
3. **Strong leader**: All writes through leader
|
||
|
||
#### Server States
|
||
|
||
- **Leader**: Handles all client requests, replicates log
|
||
- **Follower**: Passive, responds to leader and candidates
|
||
- **Candidate**: Trying to become leader
|
||
|
||
#### Leader Election
|
||
|
||
```
|
||
1. Follower times out (no heartbeat from leader)
|
||
2. Becomes Candidate, increments term, votes for self
|
||
3. Requests votes from other servers
|
||
4. Wins with majority votes → becomes Leader
|
||
5. Loses (another leader) → becomes Follower
|
||
6. Timeout → starts new election
|
||
```
|
||
|
||
**Safety**: Only candidates with up-to-date logs can win.
|
||
|
||
#### Log Replication
|
||
|
||
```
|
||
1. Client sends command to Leader
|
||
2. Leader appends to local log
|
||
3. Leader sends AppendEntries to Followers
|
||
4. On majority acknowledgment: entry is committed
|
||
5. Leader applies to state machine, responds to client
|
||
6. Followers apply committed entries
|
||
```
|
||
|
||
#### Log Matching Property
|
||
|
||
If two logs contain entry with same index and term:
|
||
- Entries are identical
|
||
- All preceding entries are identical
|
||
|
||
#### Term
|
||
|
||
Logical clock that increases with each election:
|
||
- Detects stale leaders
|
||
- Resolves conflicts
|
||
- Included in all messages
|
||
|
||
#### Comparison with Paxos
|
||
|
||
| Aspect | Paxos | Raft |
|
||
|--------|-------|------|
|
||
| Understandability | Complex | Designed for clarity |
|
||
| Leader | Optional (Multi-Paxos) | Required |
|
||
| Log gaps | Allowed | Not allowed |
|
||
| Membership changes | Complex | Joint consensus |
|
||
| Implementations | Many variants | Consistent |
|
||
|
||
### PBFT (Practical Byzantine Fault Tolerance)
|
||
|
||
Developed by Castro and Liskov (1999) for Byzantine faults.
|
||
|
||
#### Byzantine Faults
|
||
|
||
Nodes can behave arbitrarily:
|
||
- Crash
|
||
- Send incorrect messages
|
||
- Collude maliciously
|
||
- Act inconsistently to different nodes
|
||
|
||
#### Fault Tolerance
|
||
|
||
Tolerates f Byzantine faults with **3f+1** nodes.
|
||
|
||
**Why 3f+1?**
|
||
- Need 2f+1 honest responses
|
||
- f Byzantine nodes might lie
|
||
- Need f more to distinguish honest majority
|
||
|
||
#### Protocol Phases
|
||
|
||
**Normal Operation** (leader is honest):
|
||
|
||
```
|
||
1. REQUEST: Client → Primary (leader)
|
||
2. PRE-PREPARE: Primary → All replicas
|
||
- Primary assigns sequence number
|
||
3. PREPARE: Each replica → All replicas
|
||
- Validates pre-prepare
|
||
4. COMMIT: Each replica → All replicas
|
||
- After receiving 2f+1 prepares
|
||
5. REPLY: Each replica → Client
|
||
- After receiving 2f+1 commits
|
||
```
|
||
|
||
**Client waits for f+1 matching replies**.
|
||
|
||
#### View Change
|
||
|
||
When primary appears faulty:
|
||
1. Replicas timeout waiting for primary
|
||
2. Broadcast VIEW-CHANGE with prepared certificates
|
||
3. New primary collects 2f+1 view-changes
|
||
4. Broadcasts NEW-VIEW with proof
|
||
5. System resumes with new primary
|
||
|
||
#### Message Complexity
|
||
|
||
- **Normal case**: O(n²) messages per request
|
||
- **View change**: O(n³) messages
|
||
|
||
**Scalability challenge**: Quadratic messaging limits cluster size.
|
||
|
||
#### Optimizations
|
||
|
||
- **Speculative execution**: Execute before commit
|
||
- **Batching**: Group multiple requests
|
||
- **Signatures**: Use MACs instead of digital signatures
|
||
- **Threshold signatures**: Reduce signature overhead
|
||
|
||
### Modern BFT Variants
|
||
|
||
#### HotStuff (2019)
|
||
|
||
- Linear message complexity O(n)
|
||
- Used in LibraBFT (Diem), other blockchains
|
||
- Three-phase protocol with threshold signatures
|
||
|
||
#### Tendermint
|
||
|
||
- Blockchain-focused BFT
|
||
- Integrated with Cosmos SDK
|
||
- Immediate finality
|
||
|
||
#### QBFT (Quorum BFT)
|
||
|
||
- Enterprise-focused (ConsenSys/JPMorgan)
|
||
- Enhanced IBFT for Ethereum-based systems
|
||
|
||
### Nakamoto Consensus
|
||
|
||
The consensus mechanism powering Bitcoin, introduced by Satoshi Nakamoto (2008).
|
||
|
||
#### Core Innovation
|
||
|
||
Combines three elements:
|
||
1. **Proof-of-Work (PoW)**: Cryptographic puzzle for block creation
|
||
2. **Longest Chain Rule**: Fork resolution by accumulated work
|
||
3. **Probabilistic Finality**: Security increases with confirmations
|
||
|
||
#### How It Works
|
||
|
||
```
|
||
1. Transactions broadcast to network
|
||
2. Miners collect transactions into blocks
|
||
3. Miners race to solve PoW puzzle:
|
||
- Find nonce such that Hash(block_header) < target
|
||
- Difficulty adjusts to maintain ~10 min block time
|
||
4. First miner to solve broadcasts block
|
||
5. Other nodes verify and append to longest chain
|
||
6. Miner receives block reward + transaction fees
|
||
```
|
||
|
||
#### Longest Chain Rule
|
||
|
||
When forks occur:
|
||
```
|
||
Chain A: [genesis] → [1] → [2] → [3]
|
||
Chain B: [genesis] → [1] → [2'] → [3'] → [4']
|
||
|
||
Nodes follow Chain B (more accumulated work)
|
||
Chain A blocks become "orphaned"
|
||
```
|
||
|
||
**Note**: Actually "most accumulated work" not "most blocks"—a chain with fewer but harder blocks wins.
|
||
|
||
#### Security Model
|
||
|
||
**Honest Majority Assumption**: Protocol secure if honest mining power > 50%.
|
||
|
||
Formal analysis (Ren 2019):
|
||
```
|
||
Safe if: g²α > β
|
||
|
||
Where:
|
||
α = honest mining rate
|
||
β = adversarial mining rate
|
||
g = growth rate accounting for network delay
|
||
Δ = maximum network delay
|
||
```
|
||
|
||
**Implications**:
|
||
- Larger block interval → more security margin
|
||
- Higher network delay → need more honest majority
|
||
- 10-minute block time provides safety margin for global network
|
||
|
||
#### Probabilistic Finality
|
||
|
||
No instant finality—deeper blocks are exponentially harder to reverse:
|
||
|
||
| Confirmations | Attack Probability (30% attacker) |
|
||
|---------------|-----------------------------------|
|
||
| 1 | ~50% |
|
||
| 3 | ~12% |
|
||
| 6 | ~0.2% |
|
||
| 12 | ~0.003% |
|
||
|
||
**Convention**: 6 confirmations (~1 hour) considered "final" for Bitcoin.
|
||
|
||
#### Attacks
|
||
|
||
**51% Attack**: Attacker with majority hashrate can:
|
||
- Double-spend transactions
|
||
- Prevent confirmations
|
||
- NOT: steal funds, change consensus rules, create invalid transactions
|
||
|
||
**Selfish Mining**: Strategic block withholding to waste honest miners' work.
|
||
- Profitable with < 50% hashrate under certain conditions
|
||
- Mitigated by network propagation improvements
|
||
|
||
**Long-Range Attacks**: Not applicable to PoW (unlike PoS).
|
||
|
||
#### Trade-offs vs Traditional BFT
|
||
|
||
| Aspect | Nakamoto | Classical BFT |
|
||
|--------|----------|---------------|
|
||
| Finality | Probabilistic | Immediate |
|
||
| Throughput | Low (~7 TPS) | Higher |
|
||
| Participants | Permissionless | Permissioned |
|
||
| Energy | High (PoW) | Low |
|
||
| Fault tolerance | 50% hashrate | 33% nodes |
|
||
| Scalability | Global | Limited nodes |
|
||
|
||
### PnyxDB: Leaderless Democratic BFT
|
||
|
||
Developed by Bonniot, Neumann, and Taïani (2019) for consortia applications.
|
||
|
||
#### Key Innovation: Conditional Endorsements
|
||
|
||
Unlike leader-based BFT, PnyxDB uses **leaderless quorums** with conditional endorsements:
|
||
- Endorsements track conflicts between transactions
|
||
- If transactions commute (no conflicting operations), quorums built independently
|
||
- Non-commuting transactions handled via Byzantine Veto Procedure (BVP)
|
||
|
||
#### Transaction Lifecycle
|
||
|
||
```
|
||
1. Client broadcasts transaction to endorsers
|
||
2. Endorsers evaluate against application-defined policies
|
||
3. If no conflicts: endorser sends acknowledgment
|
||
4. If conflicts detected: conditional endorsement specifying
|
||
which transactions must NOT be committed for this to be valid
|
||
5. Transaction commits when quorum of valid endorsements collected
|
||
6. BVP resolves conflicting transactions
|
||
```
|
||
|
||
#### Byzantine Veto Procedure (BVP)
|
||
|
||
Ensures termination with conflicting transactions:
|
||
- Transactions have deadlines
|
||
- Conflicting endorsements trigger resolution loop
|
||
- Protocol guarantees exit when deadline passes
|
||
- At most f Byzantine nodes tolerated with n endorsers
|
||
|
||
#### Application-Level Voting
|
||
|
||
Unique feature: nodes can endorse or reject transactions based on **application-defined policies** without compromising consistency.
|
||
|
||
Use cases:
|
||
- Consortium governance decisions
|
||
- Policy-based access control
|
||
- Democratic decision making
|
||
|
||
#### Performance
|
||
|
||
Compared to BFT-SMaRt and Tendermint:
|
||
- **11x faster** commit latencies
|
||
- **< 5 seconds** in worldwide geo-distributed deployment
|
||
- Tested with **180 nodes**
|
||
|
||
#### Implementation
|
||
|
||
- Written in Go (requires Go 1.11+)
|
||
- Uses gossip broadcast for message propagation
|
||
- Web-of-trust node authentication
|
||
- Scales to hundreds/thousands of nodes
|
||
|
||
## Replication Strategies
|
||
|
||
### Single-Leader Replication
|
||
|
||
```
|
||
Clients → Leader → Followers
|
||
```
|
||
|
||
**Pros**: Simple, strong consistency possible
|
||
**Cons**: Leader bottleneck, failover complexity
|
||
|
||
#### Synchronous vs Asynchronous
|
||
|
||
| Type | Durability | Latency | Availability |
|
||
|------|------------|---------|--------------|
|
||
| Synchronous | Guaranteed | High | Lower |
|
||
| Asynchronous | At-risk | Low | Higher |
|
||
| Semi-synchronous | Balanced | Medium | Medium |
|
||
|
||
### Multi-Leader Replication
|
||
|
||
Multiple nodes accept writes, replicate to each other.
|
||
|
||
**Use cases**:
|
||
- Multi-datacenter deployment
|
||
- Clients with offline operation
|
||
|
||
**Challenges**:
|
||
- Write conflicts
|
||
- Conflict resolution complexity
|
||
|
||
#### Conflict Resolution
|
||
|
||
- **Last-write-wins (LWW)**: Timestamp-based, may lose data
|
||
- **Application-specific**: Custom merge logic
|
||
- **CRDTs**: Mathematically guaranteed convergence
|
||
|
||
### Leaderless Replication
|
||
|
||
Any node can accept reads and writes.
|
||
|
||
**Examples**: Dynamo, Cassandra, Riak
|
||
|
||
#### Quorum Reads/Writes
|
||
|
||
```
|
||
n = total replicas
|
||
w = write quorum (nodes that must acknowledge write)
|
||
r = read quorum (nodes that must respond to read)
|
||
|
||
For strong consistency: w + r > n
|
||
```
|
||
|
||
**Common configurations**:
|
||
- n=3, w=2, r=2: Tolerates 1 failure
|
||
- n=5, w=3, r=3: Tolerates 2 failures
|
||
|
||
#### Sloppy Quorums and Hinted Handoff
|
||
|
||
During partitions:
|
||
- Write to available nodes (even if not home replicas)
|
||
- "Hints" stored for unavailable nodes
|
||
- Hints replayed when nodes recover
|
||
|
||
## Failure Modes
|
||
|
||
### Crash Failures
|
||
|
||
Node stops responding. Simplest failure model.
|
||
|
||
**Detection**: Heartbeats, timeouts
|
||
**Tolerance**: 2f+1 nodes for f failures (Paxos, Raft)
|
||
|
||
### Byzantine Failures
|
||
|
||
Arbitrary behavior including malicious.
|
||
|
||
**Detection**: Difficult without redundancy
|
||
**Tolerance**: 3f+1 nodes for f failures (PBFT)
|
||
|
||
### Network Partitions
|
||
|
||
Nodes can't communicate with some other nodes.
|
||
|
||
**Impact**: Forces CP vs AP choice
|
||
**Recovery**: Reconciliation after partition heals
|
||
|
||
### Split Brain
|
||
|
||
Multiple nodes believe they are leader.
|
||
|
||
**Prevention**:
|
||
- Fencing (STONITH: Shoot The Other Node In The Head)
|
||
- Quorum-based leader election
|
||
- Lease-based leadership
|
||
|
||
## Design Patterns
|
||
|
||
### State Machine Replication
|
||
|
||
Replicate deterministic state machine across nodes:
|
||
1. All replicas start in same state
|
||
2. Apply same commands in same order
|
||
3. All reach same final state
|
||
|
||
**Requires**: Total order broadcast (consensus)
|
||
|
||
### Chain Replication
|
||
|
||
```
|
||
Head → Node2 → Node3 → ... → Tail
|
||
```
|
||
|
||
- Writes enter at head, propagate down chain
|
||
- Reads served by tail (strongly consistent)
|
||
- Simple, high throughput
|
||
|
||
### Primary-Backup
|
||
|
||
Primary handles all operations, synchronously replicates to backups.
|
||
|
||
**Failover**: Backup promoted to primary on failure.
|
||
|
||
### Quorum Systems
|
||
|
||
Intersecting sets ensure consistency:
|
||
- Any read quorum intersects any write quorum
|
||
- Guarantees reads see latest write
|
||
|
||
## Balancing Trade-offs
|
||
|
||
### Identifying Critical Requirements
|
||
|
||
1. **Correctness requirements**
|
||
- Is data loss acceptable?
|
||
- Can operations be reordered?
|
||
- Are conflicts resolvable?
|
||
|
||
2. **Availability requirements**
|
||
- What's acceptable downtime?
|
||
- Geographic distribution needs?
|
||
- Partition recovery strategy?
|
||
|
||
3. **Performance requirements**
|
||
- Latency targets?
|
||
- Throughput needs?
|
||
- Consistency cost tolerance?
|
||
|
||
### Vulnerability Mitigation by Protocol
|
||
|
||
#### Paxos/Raft (Crash Fault Tolerant)
|
||
|
||
**Vulnerabilities**:
|
||
- Leader failure causes brief unavailability
|
||
- Split-brain without proper fencing
|
||
- Slow follower impacts commit latency (sync replication)
|
||
|
||
**Mitigations**:
|
||
- Fast leader election (pre-voting)
|
||
- Quorum-based fencing
|
||
- Flexible quorum configurations
|
||
- Learner nodes for read scaling
|
||
|
||
#### PBFT (Byzantine Fault Tolerant)
|
||
|
||
**Vulnerabilities**:
|
||
- O(n²) messages limit scalability
|
||
- View change is expensive
|
||
- Requires 3f+1 nodes (more infrastructure)
|
||
|
||
**Mitigations**:
|
||
- Batching and pipelining
|
||
- Optimistic execution (HotStuff)
|
||
- Threshold signatures
|
||
- Hierarchical consensus for scaling
|
||
|
||
### Choosing the Right Protocol
|
||
|
||
| Scenario | Recommended | Rationale |
|
||
|----------|-------------|-----------|
|
||
| Internal infrastructure | Raft | Simple, well-understood |
|
||
| High consistency needs | Raft/Paxos | Proven correctness |
|
||
| Public/untrusted network | PBFT variant | Byzantine tolerance |
|
||
| Blockchain | HotStuff/Tendermint | Linear complexity BFT |
|
||
| Eventually consistent | Dynamo-style | High availability |
|
||
| Global distribution | Multi-leader + CRDTs | Partition tolerance |
|
||
|
||
## Implementation Considerations
|
||
|
||
### Timeouts
|
||
|
||
- **Heartbeat interval**: 100-300ms typical
|
||
- **Election timeout**: 10x heartbeat (avoid split votes)
|
||
- **Request timeout**: Application-dependent
|
||
|
||
### Persistence
|
||
|
||
What must be persisted before acknowledgment:
|
||
- **Raft**: Current term, voted-for, log entries
|
||
- **PBFT**: View number, prepared/committed certificates
|
||
|
||
### Membership Changes
|
||
|
||
Dynamic cluster membership:
|
||
- **Raft**: Joint consensus (old + new config)
|
||
- **Paxos**: α-reconfiguration
|
||
- **PBFT**: View change with new configuration
|
||
|
||
### Testing
|
||
|
||
- **Jepsen**: Distributed systems testing framework
|
||
- **Chaos engineering**: Intentional failure injection
|
||
- **Formal verification**: TLA+, Coq proofs
|
||
|
||
## Adversarial Oracle Protocols
|
||
|
||
Oracles bridge on-chain smart contracts with off-chain data, but introduce trust assumptions into trustless systems.
|
||
|
||
### The Oracle Problem
|
||
|
||
**Definition**: The security, authenticity, and trust conflict between third-party oracles and the trustless execution of smart contracts.
|
||
|
||
**Core Challenge**: Blockchains cannot verify correctness of external data. Oracles become:
|
||
- Single points of failure
|
||
- Targets for manipulation
|
||
- Trust assumptions in "trustless" systems
|
||
|
||
### Attack Vectors
|
||
|
||
#### Price Oracle Manipulation
|
||
|
||
**Flash Loan Attacks**:
|
||
```
|
||
1. Borrow large amount via flash loan (no collateral)
|
||
2. Manipulate price on DEX (large trade)
|
||
3. Oracle reads manipulated price
|
||
4. Smart contract executes with wrong price
|
||
5. Profit from arbitrage/liquidation
|
||
6. Repay flash loan in same transaction
|
||
```
|
||
|
||
**Notable Example**: Harvest Finance ($30M+ loss, 2020)
|
||
|
||
#### Data Source Attacks
|
||
|
||
- **Compromised API**: Single data source manipulation
|
||
- **Front-running**: Oracle updates exploited before on-chain
|
||
- **Liveness attacks**: Preventing oracle updates
|
||
- **Bribery**: Incentivizing oracle operators to lie
|
||
|
||
#### Economic Attacks
|
||
|
||
**Cost of Corruption Analysis**:
|
||
```
|
||
If oracle controls value V:
|
||
- Attack profit: V
|
||
- Attack cost: oracle stake + reputation
|
||
- Rational to attack if: profit > cost
|
||
```
|
||
|
||
**Implication**: Oracles must have stake > value they secure.
|
||
|
||
### Decentralized Oracle Networks (DONs)
|
||
|
||
#### Chainlink Model
|
||
|
||
**Multi-layer Security**:
|
||
```
|
||
1. Multiple independent data sources
|
||
2. Multiple independent node operators
|
||
3. Aggregation (median, weighted average)
|
||
4. Reputation system
|
||
5. Cryptoeconomic incentives (staking)
|
||
```
|
||
|
||
**Data Aggregation**:
|
||
```
|
||
Nodes: [Oracle₁: $100, Oracle₂: $101, Oracle₃: $150, Oracle₄: $100]
|
||
Median: $100.50
|
||
Outlier (Oracle₃) has minimal impact
|
||
```
|
||
|
||
#### Reputation and Staking
|
||
|
||
```
|
||
Node reputation based on:
|
||
- Historical accuracy
|
||
- Response time
|
||
- Uptime
|
||
- Stake amount
|
||
|
||
Job assignment weighted by reputation
|
||
Slashing for misbehavior
|
||
```
|
||
|
||
### Oracle Design Patterns
|
||
|
||
#### Time-Weighted Average Price (TWAP)
|
||
|
||
Resist single-block manipulation:
|
||
```
|
||
TWAP = Σ(price_i × duration_i) / total_duration
|
||
|
||
Example over 1 hour:
|
||
- 30 min at $100: 30 × 100 = 3000
|
||
- 20 min at $101: 20 × 101 = 2020
|
||
- 10 min at $150 (manipulation): 10 × 150 = 1500
|
||
TWAP = 6520 / 60 = $108.67 (vs $150 spot)
|
||
```
|
||
|
||
#### Commit-Reveal Schemes
|
||
|
||
Prevent front-running oracle updates:
|
||
```
|
||
Phase 1 (Commit):
|
||
- Oracle commits: hash(price || salt)
|
||
- Cannot be read by others
|
||
|
||
Phase 2 (Reveal):
|
||
- Oracle reveals: price, salt
|
||
- Contract verifies hash matches
|
||
- All oracles reveal simultaneously
|
||
```
|
||
|
||
#### Schelling Points
|
||
|
||
Game-theoretic oracle coordination:
|
||
```
|
||
1. Multiple oracles submit answers
|
||
2. Consensus answer determined
|
||
3. Oracles matching consensus rewarded
|
||
4. Outliers penalized
|
||
|
||
Assumption: Honest answer is "obvious" Schelling point
|
||
```
|
||
|
||
### Trusted Execution Environments (TEEs)
|
||
|
||
Hardware-based oracle security:
|
||
```
|
||
TEE (Intel SGX, ARM TrustZone):
|
||
- Isolated execution environment
|
||
- Code attestation
|
||
- Protected memory
|
||
- External data fetching inside enclave
|
||
```
|
||
|
||
**Benefits**:
|
||
- Verifiable computation
|
||
- Protected from host machine
|
||
- Cryptographic proofs of execution
|
||
|
||
**Limitations**:
|
||
- Hardware trust assumption
|
||
- Side-channel attacks possible
|
||
- Intel SGX vulnerabilities discovered
|
||
|
||
### Oracle Types by Data Source
|
||
|
||
| Type | Source | Trust Model | Use Case |
|
||
|------|--------|-------------|----------|
|
||
| Price feeds | Exchanges | Multiple sources | DeFi |
|
||
| Randomness | VRF/DRAND | Cryptographic | Gaming, NFTs |
|
||
| Event outcomes | Manual report | Reputation | Prediction markets |
|
||
| Cross-chain | Other blockchains | Bridge security | Interoperability |
|
||
| Computation | Off-chain compute | Verifiable | Complex logic |
|
||
|
||
### Defense Mechanisms
|
||
|
||
1. **Diversification**: Multiple independent oracles
|
||
2. **Economic security**: Stake > protected value
|
||
3. **Time delays**: Allow dispute periods
|
||
4. **Circuit breakers**: Pause on anomalous data
|
||
5. **TWAP**: Resist flash manipulation
|
||
6. **Commit-reveal**: Prevent front-running
|
||
7. **Reputation**: Long-term incentives
|
||
|
||
### Hybrid Approaches
|
||
|
||
**Optimistic Oracles**:
|
||
```
|
||
1. Oracle posts answer + bond
|
||
2. Dispute window (e.g., 2 hours)
|
||
3. If disputed: escalate to arbitration
|
||
4. If not disputed: answer accepted
|
||
5. Incorrect oracle loses bond
|
||
```
|
||
|
||
**Examples**: UMA Protocol, Optimistic Oracle
|
||
|
||
## Causality and Logical Clocks
|
||
|
||
Physical clocks cannot reliably order events in distributed systems due to clock drift and synchronization issues. Logical clocks provide ordering based on causality.
|
||
|
||
### The Happened-Before Relation
|
||
|
||
Defined by Leslie Lamport (1978):
|
||
|
||
Event a **happened-before** event b (a → b) if:
|
||
1. a and b are in the same process, and a comes before b
|
||
2. a is a send event and b is the corresponding receive
|
||
3. There exists c such that a → c and c → b (transitivity)
|
||
|
||
If neither a → b nor b → a, events are **concurrent** (a || b).
|
||
|
||
### Lamport Clocks
|
||
|
||
Simple scalar timestamps providing partial ordering.
|
||
|
||
**Rules**:
|
||
```
|
||
1. Each process maintains counter C
|
||
2. Before each event: C = C + 1
|
||
3. Send message m with timestamp C
|
||
4. On receive: C = max(C, message_timestamp) + 1
|
||
```
|
||
|
||
**Properties**:
|
||
- If a → b, then C(a) < C(b)
|
||
- **Limitation**: C(a) < C(b) does NOT imply a → b
|
||
- Cannot detect concurrent events
|
||
|
||
**Use cases**:
|
||
- Total ordering with tie-breaker (process ID)
|
||
- Distributed snapshots
|
||
- Simple event ordering
|
||
|
||
### Vector Clocks
|
||
|
||
Array of counters, one per process. Captures full causality.
|
||
|
||
**Structure** (for n processes):
|
||
```
|
||
VC[1..n] where VC[i] is process i's logical time
|
||
```
|
||
|
||
**Rules** (at process i):
|
||
```
|
||
1. Before each event: VC[i] = VC[i] + 1
|
||
2. Send message with full vector VC
|
||
3. On receive from j:
|
||
for k in 1..n:
|
||
VC[k] = max(VC[k], received_VC[k])
|
||
VC[i] = VC[i] + 1
|
||
```
|
||
|
||
**Comparison** (for vectors V1 and V2):
|
||
```
|
||
V1 = V2 iff ∀i: V1[i] = V2[i]
|
||
V1 ≤ V2 iff ∀i: V1[i] ≤ V2[i]
|
||
V1 < V2 iff V1 ≤ V2 and V1 ≠ V2
|
||
V1 || V2 iff NOT(V1 ≤ V2) and NOT(V2 ≤ V1) # concurrent
|
||
```
|
||
|
||
**Properties**:
|
||
- a → b iff VC(a) < VC(b)
|
||
- a || b iff VC(a) || VC(b)
|
||
- **Full causality detection**
|
||
|
||
**Trade-off**: O(n) space per event, where n = number of processes.
|
||
|
||
### Interval Tree Clocks (ITC)
|
||
|
||
Developed by Almeida, Baquero, and Fonte (2008) for dynamic systems.
|
||
|
||
**Problem with Vector Clocks**:
|
||
- Static: size fixed to max number of processes
|
||
- ID retirement requires global coordination
|
||
- Unsuitable for high-churn systems (P2P)
|
||
|
||
**ITC Solution**:
|
||
- Binary tree structure for ID space
|
||
- Dynamic ID allocation and deallocation
|
||
- Localized fork/join operations
|
||
|
||
**Core Operations**:
|
||
|
||
```
|
||
fork(id): Split ID into two children
|
||
- Parent retains left half
|
||
- New process gets right half
|
||
|
||
join(id1, id2): Merge two IDs
|
||
- Combine ID trees
|
||
- Localized operation, no global coordination
|
||
|
||
event(id, stamp): Increment logical clock
|
||
peek(id, stamp): Read without increment
|
||
```
|
||
|
||
**ID Space Representation**:
|
||
```
|
||
1 # Full ID space
|
||
/ \
|
||
0 1 # After one fork
|
||
/ \
|
||
0 1 # After another fork (left child)
|
||
```
|
||
|
||
**Stamp (Clock) Representation**:
|
||
- Tree structure mirrors ID space
|
||
- Each node has base value + optional children
|
||
- Efficient representation of sparse vectors
|
||
|
||
**Example**:
|
||
```
|
||
Initial: id=(1), stamp=0
|
||
Fork: id1=(1,0), stamp1=0
|
||
id2=(0,1), stamp2=0
|
||
Event at id1: stamp1=(0,(1,0))
|
||
Join id1+id2: id=(1), stamp=max of both
|
||
```
|
||
|
||
**Advantages over Vector Clocks**:
|
||
- Constant-size representation possible
|
||
- Dynamic membership without global state
|
||
- Efficient ID garbage collection
|
||
- Causality preserved across reconfigurations
|
||
|
||
**Use cases**:
|
||
- Peer-to-peer systems
|
||
- Mobile/ad-hoc networks
|
||
- Systems with frequent node join/leave
|
||
|
||
### Version Vectors
|
||
|
||
Specialization of vector clocks for tracking data versions.
|
||
|
||
**Difference from Vector Clocks**:
|
||
- Vector clocks: track all events
|
||
- Version vectors: track data updates only
|
||
|
||
**Usage in Dynamo-style systems**:
|
||
```
|
||
Client reads with version vector V1
|
||
Client writes with version vector V2
|
||
Server compares:
|
||
- If V1 < current: stale read, conflict possible
|
||
- If V1 = current: safe update
|
||
- If V1 || current: concurrent writes, need resolution
|
||
```
|
||
|
||
### Hybrid Logical Clocks (HLC)
|
||
|
||
Combines physical and logical time.
|
||
|
||
**Structure**:
|
||
```
|
||
HLC = (physical_time, logical_counter)
|
||
```
|
||
|
||
**Rules**:
|
||
```
|
||
1. On local/send event:
|
||
pt = physical_clock()
|
||
if pt > l:
|
||
l = pt
|
||
c = 0
|
||
else:
|
||
c = c + 1
|
||
return (l, c)
|
||
|
||
2. On receive with timestamp (l', c'):
|
||
pt = physical_clock()
|
||
if pt > l and pt > l':
|
||
l = pt
|
||
c = 0
|
||
elif l' > l:
|
||
l = l'
|
||
c = c' + 1
|
||
elif l > l':
|
||
c = c + 1
|
||
else: # l = l'
|
||
c = max(c, c') + 1
|
||
return (l, c)
|
||
```
|
||
|
||
**Properties**:
|
||
- Bounded drift from physical time
|
||
- Captures causality like Lamport clocks
|
||
- Timestamps comparable to wall-clock time
|
||
- Used in CockroachDB, Google Spanner
|
||
|
||
### Comparison of Logical Clocks
|
||
|
||
| Clock Type | Space | Causality | Concurrency | Dynamic |
|
||
|------------|-------|-----------|-------------|---------|
|
||
| Lamport | O(1) | Partial | No | Yes |
|
||
| Vector | O(n) | Full | Yes | No |
|
||
| ITC | O(log n)* | Full | Yes | Yes |
|
||
| HLC | O(1) | Partial | No | Yes |
|
||
|
||
*ITC space varies based on tree structure
|
||
|
||
### Practical Applications
|
||
|
||
**Conflict Detection** (Vector Clocks):
|
||
```
|
||
if V1 < V2:
|
||
# v1 is ancestor of v2, no conflict
|
||
elif V1 > V2:
|
||
# v2 is ancestor of v1, no conflict
|
||
else: # V1 || V2
|
||
# Concurrent updates, need conflict resolution
|
||
```
|
||
|
||
**Causal Broadcast**:
|
||
```
|
||
Deliver message m with VC only when:
|
||
1. VC[sender] = local_VC[sender] + 1 (next expected from sender)
|
||
2. ∀j ≠ sender: VC[j] ≤ local_VC[j] (all causal deps satisfied)
|
||
```
|
||
|
||
**Snapshot Algorithms**:
|
||
```
|
||
Consistent cut: set of events S where
|
||
if e ∈ S and f → e, then f ∈ S
|
||
Vector clocks make this efficiently verifiable
|
||
```
|
||
|
||
## References
|
||
|
||
For detailed protocol specifications and proofs, see:
|
||
- `references/consensus-protocols.md` - Detailed protocol descriptions
|
||
- `references/consistency-models.md` - Formal consistency definitions
|
||
- `references/failure-scenarios.md` - Failure mode analysis
|
||
- `references/logical-clocks.md` - Clock algorithms and implementations
|