6.9 KiB
Benchmark CPU Usage Optimization
This document describes the CPU optimization settings for the ORLY benchmark suite, specifically tuned for systems with limited CPU resources (6-core/12-thread and lower).
Problem Statement
The original benchmark implementation was designed for maximum throughput testing, which caused:
- CPU saturation: 95-100% sustained CPU usage across all cores
- System instability: Other services unable to run alongside benchmarks
- Thermal throttling: Long benchmark runs causing CPU frequency reduction
- Unrealistic load: Tight loops not representative of real-world relay usage
Solution: Aggressive Rate Limiting
The benchmark now implements multi-layered CPU usage controls:
1. Reduced Worker Concurrency
Default Worker Count: NumCPU() / 4 (minimum 2)
For a 6-core/12-thread system:
- Previous: 12 workers
- Current: 3 workers
This 4x reduction dramatically lowers:
- Goroutine context switching overhead
- Lock contention on shared resources
- CPU cache thrashing
2. Per-Operation Delays
All benchmark operations now include mandatory delays to prevent CPU saturation:
| Operation Type | Delay | Rationale |
|---|---|---|
| Event writes | 500µs | Simulates network latency and client pacing |
| Queries | 1ms | Queries are CPU-intensive, need more spacing |
| Concurrent writes | 500µs | Balanced for mixed workloads |
| Burst writes | 500µs | Prevents CPU spikes during bursts |
3. Implementation Locations
Main Benchmark (Badger backend)
Peak Throughput Test (main.go:471-473):
const eventDelay = 500 * time.Microsecond
time.Sleep(eventDelay) // After each event save
Burst Pattern Test (main.go:599-600):
const eventDelay = 500 * time.Microsecond
time.Sleep(eventDelay) // In worker loop
Query Test (main.go:899):
time.Sleep(1 * time.Millisecond) // After each query
Concurrent Query/Store (main.go:900, 1068):
time.Sleep(1 * time.Millisecond) // Readers
time.Sleep(500 * time.Microsecond) // Writers
BenchmarkAdapter (DGraph/Neo4j backends)
Peak Throughput (benchmark_adapter.go:58):
const eventDelay = 500 * time.Microsecond
Burst Pattern (benchmark_adapter.go:142):
const eventDelay = 500 * time.Microsecond
Expected CPU Usage
Before Optimization
- Workers: 12 (on 12-thread system)
- Delays: None or minimal
- CPU Usage: 95-100% sustained
- System Impact: Severe - other processes starved
After Optimization
- Workers: 3 (on 12-thread system)
- Delays: 500µs-1ms per operation
- Expected CPU Usage: 40-60% average, 70% peak
- System Impact: Minimal - plenty of headroom for other processes
Performance Impact
Throughput Reduction
The aggressive rate limiting will reduce benchmark throughput:
Before (unrealistic, CPU-bound):
- ~50,000 events/second with 12 workers
After (realistic, rate-limited):
- ~5,000-10,000 events/second with 3 workers
- More representative of real-world relay load
- Network latency and client pacing simulated
Latency Accuracy
Improved: With lower CPU contention, latency measurements are more accurate:
- Less queueing delay in database operations
- More consistent response times
- Better P95/P99 metric reliability
Tuning Guide
If you need to adjust CPU usage further:
Further Reduce CPU (< 40%)
-
Reduce workers:
./benchmark --workers 2 # Half of default -
Increase delays in code:
// Change from 500µs to 1ms for writes const eventDelay = 1 * time.Millisecond // Change from 1ms to 2ms for queries time.Sleep(2 * time.Millisecond) -
Reduce event count:
./benchmark --events 5000 # Shorter test runs
Increase CPU (for faster testing)
-
Increase workers:
./benchmark --workers 6 # More concurrency -
Decrease delays in code:
// Change from 500µs to 100µs const eventDelay = 100 * time.Microsecond // Change from 1ms to 500µs time.Sleep(500 * time.Microsecond)
Monitoring CPU Usage
Real-time Monitoring
# Terminal 1: Run benchmark
cd cmd/benchmark
./benchmark --workers 3 --events 10000
# Terminal 2: Monitor CPU
watch -n 1 'ps aux | grep benchmark | grep -v grep | awk "{print \$3\" %CPU\"}"'
With htop (recommended)
# Install htop if needed
sudo apt install htop
# Run htop and filter for benchmark process
htop -p $(pgrep -f benchmark)
System-wide CPU Usage
# Check overall system load
mpstat 1
# Or with sar
sar -u 1
Docker Compose Considerations
When running the full benchmark suite in Docker Compose:
Resource Limits
The compose file should limit CPU allocation:
services:
benchmark-runner:
deploy:
resources:
limits:
cpus: '4' # Limit to 4 CPU cores
Sequential vs Parallel
Current implementation runs benchmarks sequentially to avoid overwhelming the system. Each relay is tested one at a time, ensuring:
- Consistent baseline for comparisons
- No CPU competition between tests
- Reliable latency measurements
Best Practices
- Always monitor CPU during first run to verify settings work for your system
- Close other applications during benchmarking for consistent results
- Use consistent worker counts across test runs for fair comparisons
- Document your settings if you modify delay constants
- Test with small event counts first (--events 1000) to verify CPU usage
Realistic Workload Simulation
The delays aren't just for CPU management - they simulate real-world conditions:
- 500µs write delay: Typical network round-trip time for local clients
- 1ms query delay: Client thinking time between queries
- 3 workers: Simulates 3 concurrent users/clients
- Burst patterns: Models social media posting patterns (busy hours vs quiet periods)
This makes benchmark results more applicable to production relay deployment planning.
System Requirements
Minimum
- 4 CPU cores (2 physical cores with hyperthreading)
- 8GB RAM
- SSD storage for database
Recommended
- 6+ CPU cores
- 16GB RAM
- NVMe SSD
For Full Suite (Docker Compose)
- 8+ CPU cores (allows multiple relays + benchmark runner)
- 32GB RAM (Neo4j, DGraph are memory-hungry)
- Fast SSD with 100GB+ free space
Conclusion
These aggressive CPU optimizations ensure the benchmark suite:
- ✅ Runs reliably on modest hardware
- ✅ Doesn't interfere with other system processes
- ✅ Produces realistic, production-relevant metrics
- ✅ Completes without thermal throttling
- ✅ Allows fair comparison across different relay implementations
The trade-off is longer test duration, but the results are far more valuable for actual relay deployment planning.