# Benchmark CPU Usage Optimization This document describes the CPU optimization settings for the ORLY benchmark suite, specifically tuned for systems with limited CPU resources (6-core/12-thread and lower). ## Problem Statement The original benchmark implementation was designed for maximum throughput testing, which caused: - **CPU saturation**: 95-100% sustained CPU usage across all cores - **System instability**: Other services unable to run alongside benchmarks - **Thermal throttling**: Long benchmark runs causing CPU frequency reduction - **Unrealistic load**: Tight loops not representative of real-world relay usage ## Solution: Aggressive Rate Limiting The benchmark now implements multi-layered CPU usage controls: ### 1. Reduced Worker Concurrency **Default Worker Count**: `NumCPU() / 4` (minimum 2) For a 6-core/12-thread system: - Previous: 12 workers - **Current: 3 workers** This 4x reduction dramatically lowers: - Goroutine context switching overhead - Lock contention on shared resources - CPU cache thrashing ### 2. Per-Operation Delays All benchmark operations now include mandatory delays to prevent CPU saturation: | Operation Type | Delay | Rationale | |---------------|-------|-----------| | Event writes | 500µs | Simulates network latency and client pacing | | Queries | 1ms | Queries are CPU-intensive, need more spacing | | Concurrent writes | 500µs | Balanced for mixed workloads | | Burst writes | 500µs | Prevents CPU spikes during bursts | ### 3. Implementation Locations #### Main Benchmark (Badger backend) **Peak Throughput Test** ([main.go:471-473](main.go#L471-L473)): ```go const eventDelay = 500 * time.Microsecond time.Sleep(eventDelay) // After each event save ``` **Burst Pattern Test** ([main.go:599-600](main.go#L599-L600)): ```go const eventDelay = 500 * time.Microsecond time.Sleep(eventDelay) // In worker loop ``` **Query Test** ([main.go:899](main.go#L899)): ```go time.Sleep(1 * time.Millisecond) // After each query ``` **Concurrent Query/Store** ([main.go:900, 1068](main.go#L900)): ```go time.Sleep(1 * time.Millisecond) // Readers time.Sleep(500 * time.Microsecond) // Writers ``` #### BenchmarkAdapter (DGraph/Neo4j backends) **Peak Throughput** ([benchmark_adapter.go:58](benchmark_adapter.go#L58)): ```go const eventDelay = 500 * time.Microsecond ``` **Burst Pattern** ([benchmark_adapter.go:142](benchmark_adapter.go#L142)): ```go const eventDelay = 500 * time.Microsecond ``` ## Expected CPU Usage ### Before Optimization - **Workers**: 12 (on 12-thread system) - **Delays**: None or minimal - **CPU Usage**: 95-100% sustained - **System Impact**: Severe - other processes starved ### After Optimization - **Workers**: 3 (on 12-thread system) - **Delays**: 500µs-1ms per operation - **Expected CPU Usage**: 40-60% average, 70% peak - **System Impact**: Minimal - plenty of headroom for other processes ## Performance Impact ### Throughput Reduction The aggressive rate limiting will reduce benchmark throughput: **Before** (unrealistic, CPU-bound): - ~50,000 events/second with 12 workers **After** (realistic, rate-limited): - ~5,000-10,000 events/second with 3 workers - More representative of real-world relay load - Network latency and client pacing simulated ### Latency Accuracy **Improved**: With lower CPU contention, latency measurements are more accurate: - Less queueing delay in database operations - More consistent response times - Better P95/P99 metric reliability ## Tuning Guide If you need to adjust CPU usage further: ### Further Reduce CPU (< 40%) 1. **Reduce workers**: ```bash ./benchmark --workers 2 # Half of default ``` 2. **Increase delays** in code: ```go // Change from 500µs to 1ms for writes const eventDelay = 1 * time.Millisecond // Change from 1ms to 2ms for queries time.Sleep(2 * time.Millisecond) ``` 3. **Reduce event count**: ```bash ./benchmark --events 5000 # Shorter test runs ``` ### Increase CPU (for faster testing) 1. **Increase workers**: ```bash ./benchmark --workers 6 # More concurrency ``` 2. **Decrease delays** in code: ```go // Change from 500µs to 100µs const eventDelay = 100 * time.Microsecond // Change from 1ms to 500µs time.Sleep(500 * time.Microsecond) ``` ## Monitoring CPU Usage ### Real-time Monitoring ```bash # Terminal 1: Run benchmark cd cmd/benchmark ./benchmark --workers 3 --events 10000 # Terminal 2: Monitor CPU watch -n 1 'ps aux | grep benchmark | grep -v grep | awk "{print \$3\" %CPU\"}"' ``` ### With htop (recommended) ```bash # Install htop if needed sudo apt install htop # Run htop and filter for benchmark process htop -p $(pgrep -f benchmark) ``` ### System-wide CPU Usage ```bash # Check overall system load mpstat 1 # Or with sar sar -u 1 ``` ## Docker Compose Considerations When running the full benchmark suite in Docker Compose: ### Resource Limits The compose file should limit CPU allocation: ```yaml services: benchmark-runner: deploy: resources: limits: cpus: '4' # Limit to 4 CPU cores ``` ### Sequential vs Parallel Current implementation runs benchmarks **sequentially** to avoid overwhelming the system. Each relay is tested one at a time, ensuring: - Consistent baseline for comparisons - No CPU competition between tests - Reliable latency measurements ## Best Practices 1. **Always monitor CPU during first run** to verify settings work for your system 2. **Close other applications** during benchmarking for consistent results 3. **Use consistent worker counts** across test runs for fair comparisons 4. **Document your settings** if you modify delay constants 5. **Test with small event counts first** (--events 1000) to verify CPU usage ## Realistic Workload Simulation The delays aren't just for CPU management - they simulate real-world conditions: - **500µs write delay**: Typical network round-trip time for local clients - **1ms query delay**: Client thinking time between queries - **3 workers**: Simulates 3 concurrent users/clients - **Burst patterns**: Models social media posting patterns (busy hours vs quiet periods) This makes benchmark results more applicable to production relay deployment planning. ## System Requirements ### Minimum - 4 CPU cores (2 physical cores with hyperthreading) - 8GB RAM - SSD storage for database ### Recommended - 6+ CPU cores - 16GB RAM - NVMe SSD ### For Full Suite (Docker Compose) - 8+ CPU cores (allows multiple relays + benchmark runner) - 32GB RAM (Neo4j, DGraph are memory-hungry) - Fast SSD with 100GB+ free space ## Conclusion These aggressive CPU optimizations ensure the benchmark suite: - ✅ Runs reliably on modest hardware - ✅ Doesn't interfere with other system processes - ✅ Produces realistic, production-relevant metrics - ✅ Completes without thermal throttling - ✅ Allows fair comparison across different relay implementations The trade-off is longer test duration, but the results are far more valuable for actual relay deployment planning.