Decompose handle-event.go into DDD domain services (v0.36.15)

Major refactoring of event handling into clean, testable domain services: - Add pkg/event/validation: JSON hex validation, signature verification, timestamp bounds, NIP-70 protected tag validation - Add pkg/event/authorization: Policy and ACL authorization decisions, auth challenge handling, access level determination - Add pkg/event/routing: Event router registry with ephemeral and delete handlers, kind-based dispatch - Add pkg/event/processing: Event persistence, delivery to subscribers, and post-save hooks (ACL reconfig, sync, relay groups) - Reduce handle-event.go from 783 to 296 lines (62% reduction) - Add comprehensive unit tests for all new domain services - Refactor database tests to use shared TestMain setup - Fix blossom URL test expectations (missing "/" separator) - Add go-memory-optimization skill and analysis documentation - Update DDD_ANALYSIS.md to reflect completed decomposition Files modified: - app/handle-event.go: Slim orchestrator using domain services - app/server.go: Service initialization and interface wrappers - app/handle-event-types.go: Shared types (OkHelper, result types) - pkg/event/validation/*: New validation service package - pkg/event/authorization/*: New authorization service package - pkg/event/routing/*: New routing service package - pkg/event/processing/*: New processing service package - pkg/database/*_test.go: Refactored to shared TestMain - pkg/blossom/http_test.go: Fixed URL format expectations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-25 05:30:07 +01:00
parent 3e0a94a053
commit 24383ef1f4
42 changed files with 4791 additions and 2118 deletions
--- a/docs/MEMORY_OPTIMIZATION_ANALYSIS.md
+++ b/docs/MEMORY_OPTIMIZATION_ANALYSIS.md
@@ -0,0 +1,366 @@
+# ORLY Relay Memory Optimization Analysis
+
+This document analyzes ORLY's current memory optimization patterns against Go best practices for high-performance systems. The analysis covers buffer management, caching strategies, allocation patterns, and identifies optimization opportunities.
+
+## Executive Summary
+
+ORLY implements several sophisticated memory optimization strategies:
+- **Compact event storage** achieving ~87% space savings via serial references
+- **Two-level caching** for serial lookups and query results
+- **ZSTD compression** for query cache with LRU eviction
+- **Atomic operations** for lock-free statistics tracking
+- **Pre-allocation patterns** for slice capacity management
+
+However, several opportunities exist to further reduce GC pressure:
+- Implement `sync.Pool` for frequently allocated buffers
+- Use fixed-size arrays for cryptographic values
+- Pool `bytes.Buffer` instances in hot paths
+- Optimize escape behavior in serialization code
+
+---
+
+## Current Memory Patterns
+
+### 1. Compact Event Storage
+
+**Location**: `pkg/database/compact_event.go`
+
+ORLY's most significant memory optimization is the compact binary format for event storage:
+
+```
+Original event:  32 (ID) + 32 (pubkey) + 32*4 (tags) = 192+ bytes
+Compact format:   5 (pubkey serial) + 5*4 (tag serials) = 25 bytes
+Savings: ~87% compression per event
+```
+
+**Key techniques:**
+- 5-byte serial references replace 32-byte IDs/pubkeys
+- Varint encoding for variable-length integers (CreatedAt, tag counts)
+- Type flags for efficient deserialization
+- Separate `SerialEventId` index for ID reconstruction
+
+**Assessment**: Excellent storage optimization. This dramatically reduces database size and I/O costs.
+
+### 2. Serial Cache System
+
+**Location**: `pkg/database/serial_cache.go`
+
+Two-way lookup cache for serial ↔ ID/pubkey mappings:
+
+```go
+type SerialCache struct {
+    pubkeyBySerial     map[uint64][]byte      // For decoding
+    serialByPubkeyHash map[string]uint64      // For encoding
+    eventIdBySerial    map[uint64][]byte      // For decoding
+    serialByEventIdHash map[string]uint64     // For encoding
+}
+```
+
+**Memory footprint:**
+- Pubkey cache: 100k entries × 32 bytes ≈ 3.2MB
+- Event ID cache: 500k entries × 32 bytes ≈ 16MB
+- Total: ~19-20MB overhead
+
+**Strengths:**
+- Fine-grained `RWMutex` locking per direction/type
+- Configurable cache limits
+- Defensive copying prevents external mutations
+
+**Improvement opportunity:** The eviction strategy (clear 50% when full) is simple but not LRU. Consider ring buffers or generational caching for better hit rates.
+
+### 3. Query Cache with ZSTD Compression
+
+**Location**: `pkg/database/querycache/event_cache.go`
+
+```go
+type EventCache struct {
+    entries   map[string]*EventCacheEntry
+    lruList   *list.List
+    encoder   *zstd.Encoder  // Reused encoder (level 9)
+    decoder   *zstd.Decoder  // Reused decoder
+    maxSize   int64          // Default 512MB compressed
+}
+```
+
+**Strengths:**
+- ZSTD level 9 compression (best ratio)
+- Encoder/decoder reuse avoids repeated initialization
+- LRU eviction with proper size tracking
+- Background cleanup of expired entries
+- Tracks compression ratio with exponential moving average
+
+**Memory pattern:** Stores compressed data in cache, decompresses on-demand. This trades CPU for memory.
+
+### 4. Buffer Allocation Patterns
+
+**Current approach:** Uses `new(bytes.Buffer)` throughout serialization code:
+
+```go
+// pkg/database/save-event.go, compact_event.go, serial_cache.go
+buf := new(bytes.Buffer)
+// ... encode data
+return buf.Bytes()
+```
+
+**Assessment:** Each call allocates a new buffer on the heap. For high-throughput scenarios (thousands of events/second), this creates significant GC pressure.
+
+---
+
+## Optimization Opportunities
+
+### 1. Implement sync.Pool for Buffer Reuse
+
+**Priority: High**
+
+Currently, ORLY creates new `bytes.Buffer` instances for every serialization operation. A buffer pool would amortize allocation costs:
+
+```go
+// Recommended implementation
+var bufferPool = sync.Pool{
+    New: func() interface{} {
+        return bytes.NewBuffer(make([]byte, 0, 4096))
+    },
+}
+
+func getBuffer() *bytes.Buffer {
+    return bufferPool.Get().(*bytes.Buffer)
+}
+
+func putBuffer(buf *bytes.Buffer) {
+    buf.Reset()
+    bufferPool.Put(buf)
+}
+```
+
+**Impact areas:**
+- `pkg/database/compact_event.go` - MarshalCompactEvent, encodeCompactTag
+- `pkg/database/save-event.go` - index key generation
+- `pkg/database/serial_cache.go` - GetEventIdBySerial, StoreEventIdSerial
+
+**Expected benefit:** 50-80% reduction in buffer allocations on hot paths.
+
+### 2. Fixed-Size Array Types for Cryptographic Values
+
+**Priority: Medium**
+
+The external nostr library uses `[]byte` slices for IDs, pubkeys, and signatures. However, these are always fixed sizes:
+
+| Type | Size | Current | Recommended |
+|------|------|---------|-------------|
+| Event ID | 32 bytes | `[]byte` | `[32]byte` |
+| Pubkey | 32 bytes | `[]byte` | `[32]byte` |
+| Signature | 64 bytes | `[]byte` | `[64]byte` |
+
+Internal types like `Uint40` already follow this pattern but use struct wrapping:
+
+```go
+// Current (pkg/database/indexes/types/uint40.go)
+type Uint40 struct{ value uint64 }
+
+// Already efficient - no slice allocation
+```
+
+For cryptographic values, consider wrapper types:
+
+```go
+type EventID [32]byte
+type Pubkey [32]byte
+type Signature [64]byte
+
+func (id EventID) IsZero() bool { return id == EventID{} }
+func (id EventID) Hex() string  { return hex.Enc(id[:]) }
+```
+
+**Benefit:** Stack allocation for local variables, zero-value comparison efficiency.
+
+### 3. Pre-allocated Slice Patterns
+
+**Current usage is good:**
+
+```go
+// pkg/database/save-event.go:51-54
+sers = make(types.Uint40s, 0, len(idxs)*100) // Estimate 100 serials per index
+
+// pkg/database/compact_event.go:283
+ev.Tags = tag.NewSWithCap(int(nTags)) // Pre-allocate tag slice
+```
+
+**Improvement:** Apply consistently to:
+- `Uint40s.Union/Intersection/Difference` methods (currently use `append` without capacity hints)
+- Query result accumulation in `query-events.go`
+
+### 4. Escape Analysis Optimization
+
+**Priority: Medium**
+
+Several patterns cause unnecessary heap escapes. Check with:
+
+```bash
+go build -gcflags="-m -m" ./pkg/database/...
+```
+
+**Common escape causes in codebase:**
+
+```go
+// compact_event.go:224 - Small slice escapes
+buf := make([]byte, 5)  // Could be [5]byte on stack
+
+// compact_event.go:335 - Single-byte slice escapes
+typeBuf := make([]byte, 1)  // Could be var typeBuf [1]byte
+```
+
+**Fix:**
+```go
+func readUint40(r io.Reader) (value uint64, err error) {
+    var buf [5]byte  // Stack-allocated
+    if _, err = io.ReadFull(r, buf[:]); err != nil {
+        return 0, err
+    }
+    // ...
+}
+```
+
+### 5. Atomic Bytes Wrapper Optimization
+
+**Location**: `pkg/utils/atomic/bytes.go`
+
+Current implementation copies on both Load and Store:
+
+```go
+func (x *Bytes) Load() (b []byte) {
+    vb := x.v.Load().([]byte)
+    b = make([]byte, len(vb))  // Allocation on every Load
+    copy(b, vb)
+    return
+}
+```
+
+This is safe but expensive for high-frequency access. Consider:
+- Read-copy-update (RCU) pattern for read-heavy workloads
+- `sync.RWMutex` with direct access for controlled use cases
+
+### 6. Goroutine Management
+
+**Current patterns:**
+- Worker goroutines for message processing (`app/listener.go`)
+- Background cleanup goroutines (`querycache/event_cache.go`)
+- Pinger goroutines per connection (`app/handle-websocket.go`)
+
+**Assessment:** Good use of bounded channels and `sync.WaitGroup` for lifecycle management.
+
+**Improvement:** Consider a worker pool for subscription handlers to limit peak goroutine count:
+
+```go
+type WorkerPool struct {
+    jobs    chan func()
+    workers int
+    wg      sync.WaitGroup
+}
+```
+
+---
+
+## Memory Budget Analysis
+
+### Runtime Memory Breakdown
+
+| Component | Estimated Size | Notes |
+|-----------|---------------|-------|
+| Serial Cache (pubkeys) | 3.2 MB | 100k × 32 bytes |
+| Serial Cache (event IDs) | 16 MB | 500k × 32 bytes |
+| Query Cache | 512 MB | Configurable, compressed |
+| Per-connection state | ~10 KB | Channels, buffers, maps |
+| Badger DB caches | Variable | Controlled by Badger config |
+
+### GC Tuning Recommendations
+
+For a relay handling 1000+ events/second:
+
+```go
+// main.go or init
+import "runtime/debug"
+
+func init() {
+    // More aggressive GC to limit heap growth
+    debug.SetGCPercent(50)  // GC at 50% heap growth (default 100)
+
+    // Set soft memory limit based on available RAM
+    debug.SetMemoryLimit(2 << 30)  // 2GB limit
+}
+```
+
+Or via environment:
+```bash
+GOGC=50 GOMEMLIMIT=2GiB ./orly
+```
+
+---
+
+## Profiling Commands
+
+### Heap Profile
+
+```bash
+# Enable pprof (already supported)
+ORLY_PPROF_HTTP=true ./orly
+
+# Capture heap profile
+go tool pprof http://localhost:6060/debug/pprof/heap
+
+# Analyze allocations
+go tool pprof -alloc_space heap.prof
+go tool pprof -inuse_space heap.prof
+```
+
+### Escape Analysis
+
+```bash
+# Check which variables escape to heap
+go build -gcflags="-m -m" ./pkg/database/... 2>&1 | grep "escapes to heap"
+```
+
+### Allocation Benchmarks
+
+Add to existing benchmarks:
+
+```go
+func BenchmarkCompactMarshal(b *testing.B) {
+    b.ReportAllocs()
+    ev := createTestEvent()
+    resolver := &testResolver{}
+
+    b.ResetTimer()
+    for i := 0; i < b.N; i++ {
+        data, _ := MarshalCompactEvent(ev, resolver)
+        _ = data
+    }
+}
+```
+
+---
+
+## Implementation Priority
+
+1. **High Priority (Immediate Impact)**
+   - Implement `sync.Pool` for `bytes.Buffer` in serialization paths
+   - Replace small `make([]byte, n)` with fixed arrays in decode functions
+
+2. **Medium Priority (Significant Improvement)**
+   - Add pre-allocation hints to set operation methods
+   - Optimize escape behavior in compact event encoding
+   - Consider worker pool for subscription handlers
+
+3. **Low Priority (Refinement)**
+   - LRU-based serial cache eviction
+   - Fixed-size types for cryptographic values (requires nostr library changes)
+   - RCU pattern for atomic bytes in high-frequency paths
+
+---
+
+## Conclusion
+
+ORLY demonstrates thoughtful memory optimization in its storage layer, particularly the compact event format achieving 87% space savings. The dual-cache architecture (serial cache + query cache) balances memory usage with lookup performance.
+
+The primary opportunity for improvement is in the serialization hot path, where buffer pooling could significantly reduce GC pressure. The recommended `sync.Pool` implementation would have immediate benefits for high-throughput deployments without requiring architectural changes.
+
+Secondary improvements around escape analysis and fixed-size types would provide incremental gains and should be prioritized based on profiling data from production workloads.