Files
next.orly.dev/cmd/benchmark/CPU_OPTIMIZATION.md
2025-11-19 11:25:38 +00:00

6.9 KiB

Benchmark CPU Usage Optimization

This document describes the CPU optimization settings for the ORLY benchmark suite, specifically tuned for systems with limited CPU resources (6-core/12-thread and lower).

Problem Statement

The original benchmark implementation was designed for maximum throughput testing, which caused:

  • CPU saturation: 95-100% sustained CPU usage across all cores
  • System instability: Other services unable to run alongside benchmarks
  • Thermal throttling: Long benchmark runs causing CPU frequency reduction
  • Unrealistic load: Tight loops not representative of real-world relay usage

Solution: Aggressive Rate Limiting

The benchmark now implements multi-layered CPU usage controls:

1. Reduced Worker Concurrency

Default Worker Count: NumCPU() / 4 (minimum 2)

For a 6-core/12-thread system:

  • Previous: 12 workers
  • Current: 3 workers

This 4x reduction dramatically lowers:

  • Goroutine context switching overhead
  • Lock contention on shared resources
  • CPU cache thrashing

2. Per-Operation Delays

All benchmark operations now include mandatory delays to prevent CPU saturation:

Operation Type Delay Rationale
Event writes 500µs Simulates network latency and client pacing
Queries 1ms Queries are CPU-intensive, need more spacing
Concurrent writes 500µs Balanced for mixed workloads
Burst writes 500µs Prevents CPU spikes during bursts

3. Implementation Locations

Main Benchmark (Badger backend)

Peak Throughput Test (main.go:471-473):

const eventDelay = 500 * time.Microsecond
time.Sleep(eventDelay) // After each event save

Burst Pattern Test (main.go:599-600):

const eventDelay = 500 * time.Microsecond
time.Sleep(eventDelay) // In worker loop

Query Test (main.go:899):

time.Sleep(1 * time.Millisecond) // After each query

Concurrent Query/Store (main.go:900, 1068):

time.Sleep(1 * time.Millisecond)  // Readers
time.Sleep(500 * time.Microsecond) // Writers

BenchmarkAdapter (DGraph/Neo4j backends)

Peak Throughput (benchmark_adapter.go:58):

const eventDelay = 500 * time.Microsecond

Burst Pattern (benchmark_adapter.go:142):

const eventDelay = 500 * time.Microsecond

Expected CPU Usage

Before Optimization

  • Workers: 12 (on 12-thread system)
  • Delays: None or minimal
  • CPU Usage: 95-100% sustained
  • System Impact: Severe - other processes starved

After Optimization

  • Workers: 3 (on 12-thread system)
  • Delays: 500µs-1ms per operation
  • Expected CPU Usage: 40-60% average, 70% peak
  • System Impact: Minimal - plenty of headroom for other processes

Performance Impact

Throughput Reduction

The aggressive rate limiting will reduce benchmark throughput:

Before (unrealistic, CPU-bound):

  • ~50,000 events/second with 12 workers

After (realistic, rate-limited):

  • ~5,000-10,000 events/second with 3 workers
  • More representative of real-world relay load
  • Network latency and client pacing simulated

Latency Accuracy

Improved: With lower CPU contention, latency measurements are more accurate:

  • Less queueing delay in database operations
  • More consistent response times
  • Better P95/P99 metric reliability

Tuning Guide

If you need to adjust CPU usage further:

Further Reduce CPU (< 40%)

  1. Reduce workers:

    ./benchmark --workers 2  # Half of default
    
  2. Increase delays in code:

    // Change from 500µs to 1ms for writes
    const eventDelay = 1 * time.Millisecond
    
    // Change from 1ms to 2ms for queries
    time.Sleep(2 * time.Millisecond)
    
  3. Reduce event count:

    ./benchmark --events 5000  # Shorter test runs
    

Increase CPU (for faster testing)

  1. Increase workers:

    ./benchmark --workers 6  # More concurrency
    
  2. Decrease delays in code:

    // Change from 500µs to 100µs
    const eventDelay = 100 * time.Microsecond
    
    // Change from 1ms to 500µs
    time.Sleep(500 * time.Microsecond)
    

Monitoring CPU Usage

Real-time Monitoring

# Terminal 1: Run benchmark
cd cmd/benchmark
./benchmark --workers 3 --events 10000

# Terminal 2: Monitor CPU
watch -n 1 'ps aux | grep benchmark | grep -v grep | awk "{print \$3\" %CPU\"}"'
# Install htop if needed
sudo apt install htop

# Run htop and filter for benchmark process
htop -p $(pgrep -f benchmark)

System-wide CPU Usage

# Check overall system load
mpstat 1

# Or with sar
sar -u 1

Docker Compose Considerations

When running the full benchmark suite in Docker Compose:

Resource Limits

The compose file should limit CPU allocation:

services:
  benchmark-runner:
    deploy:
      resources:
        limits:
          cpus: '4'  # Limit to 4 CPU cores

Sequential vs Parallel

Current implementation runs benchmarks sequentially to avoid overwhelming the system. Each relay is tested one at a time, ensuring:

  • Consistent baseline for comparisons
  • No CPU competition between tests
  • Reliable latency measurements

Best Practices

  1. Always monitor CPU during first run to verify settings work for your system
  2. Close other applications during benchmarking for consistent results
  3. Use consistent worker counts across test runs for fair comparisons
  4. Document your settings if you modify delay constants
  5. Test with small event counts first (--events 1000) to verify CPU usage

Realistic Workload Simulation

The delays aren't just for CPU management - they simulate real-world conditions:

  • 500µs write delay: Typical network round-trip time for local clients
  • 1ms query delay: Client thinking time between queries
  • 3 workers: Simulates 3 concurrent users/clients
  • Burst patterns: Models social media posting patterns (busy hours vs quiet periods)

This makes benchmark results more applicable to production relay deployment planning.

System Requirements

Minimum

  • 4 CPU cores (2 physical cores with hyperthreading)
  • 8GB RAM
  • SSD storage for database
  • 6+ CPU cores
  • 16GB RAM
  • NVMe SSD

For Full Suite (Docker Compose)

  • 8+ CPU cores (allows multiple relays + benchmark runner)
  • 32GB RAM (Neo4j, DGraph are memory-hungry)
  • Fast SSD with 100GB+ free space

Conclusion

These aggressive CPU optimizations ensure the benchmark suite:

  • Runs reliably on modest hardware
  • Doesn't interfere with other system processes
  • Produces realistic, production-relevant metrics
  • Completes without thermal throttling
  • Allows fair comparison across different relay implementations

The trade-off is longer test duration, but the results are far more valuable for actual relay deployment planning.