optimizing badger cache, won a 10-15% improvement in most benchmarks

2025-11-16 15:07:36 +00:00
parent 9bb3a7e057
commit 95bcf85ad7
72 changed files with 8158 additions and 4048 deletions
--- a/cmd/benchmark/CACHE_TUNING_ANALYSIS.md
+++ b/cmd/benchmark/CACHE_TUNING_ANALYSIS.md
@@ -0,0 +1,97 @@
+# Badger Cache Tuning Analysis
+
+## Problem Identified
+
+From benchmark run `run_20251116_092759`, the Badger block cache showed critical performance issues:
+
+### Cache Metrics (Round 1):
+```
+Block cache might be too small. Metrics:
+- hit: 151,469
+- miss: 307,989
+- hit-ratio: 0.33 (33%)
+- keys-added: 226,912
+- keys-evicted: 226,893 (99.99% eviction rate!)
+- Cache life expectancy: 2 seconds (90th percentile)
+```
+
+### Performance Impact:
+- **Burst Pattern Latency**: 9.35ms avg (vs 3.61ms for khatru-sqlite)
+- **P95 Latency**: 34.48ms (vs 8.59ms for khatru-sqlite)
+- **Cache hit ratio**: Only 33% - causing constant disk I/O
+
+## Root Cause
+
+The benchmark container was using **default Badger cache sizes** (much smaller than the code defaults):
+- Block cache: ~64 MB (Badger default)
+- Index cache: ~32 MB (Badger default)
+
+The code has better defaults (1024 MB / 512 MB), but these weren't set in the Docker container.
+
+## Cache Size Calculation
+
+Based on benchmark workload analysis:
+
+### Block Cache Requirements:
+- Total cost added: 12.44 TB during test
+- With 226K keys and immediate evictions, we need to hold ~100-200K blocks in memory
+- At ~10-20 KB per block average: **2-4 GB needed**
+
+### Index Cache Requirements:
+- For 200K+ keys with metadata
+- Efficient index lookups during queries
+- **1-2 GB needed**
+
+## Solution
+
+Updated `Dockerfile.next-orly` with optimized cache settings:
+
+```dockerfile
+ENV ORLY_DB_BLOCK_CACHE_MB=2048  # 2 GB block cache
+ENV ORLY_DB_INDEX_CACHE_MB=1024  # 1 GB index cache
+```
+
+### Expected Improvements:
+- **Cache hit ratio**: Target 85-95% (up from 33%)
+- **Burst pattern latency**: Target <5ms avg (down from 9.35ms)
+- **P95 latency**: Target <15ms (down from 34.48ms)
+- **Query latency**: Significant reduction due to cached index lookups
+
+## Testing Strategy
+
+1. Rebuild Docker image with new cache settings
+2. Run full benchmark suite
+3. Compare metrics:
+   - Cache hit ratio
+   - Average/P95/P99 latencies
+   - Throughput under burst patterns
+   - Memory usage
+
+## Memory Budget
+
+With these settings, the relay will use approximately:
+- Block cache: 2 GB
+- Index cache: 1 GB
+- Badger internal structures: ~200 MB
+- Go runtime: ~200 MB
+- **Total**: ~3.5 GB
+
+This is reasonable for a high-performance relay and well within modern server capabilities.
+
+## Alternative Configurations
+
+For constrained environments:
+
+### Medium (1.5 GB total):
+```
+ORLY_DB_BLOCK_CACHE_MB=1024
+ORLY_DB_INDEX_CACHE_MB=512
+```
+
+### Minimal (512 MB total):
+```
+ORLY_DB_BLOCK_CACHE_MB=384
+ORLY_DB_INDEX_CACHE_MB=128
+```
+
+Note: Smaller caches will result in lower hit ratios and higher latencies.