Some checks failed
Go / build-and-release (push) Has been cancelled
Major refactoring of event handling into clean, testable domain services: - Add pkg/event/validation: JSON hex validation, signature verification, timestamp bounds, NIP-70 protected tag validation - Add pkg/event/authorization: Policy and ACL authorization decisions, auth challenge handling, access level determination - Add pkg/event/routing: Event router registry with ephemeral and delete handlers, kind-based dispatch - Add pkg/event/processing: Event persistence, delivery to subscribers, and post-save hooks (ACL reconfig, sync, relay groups) - Reduce handle-event.go from 783 to 296 lines (62% reduction) - Add comprehensive unit tests for all new domain services - Refactor database tests to use shared TestMain setup - Fix blossom URL test expectations (missing "/" separator) - Add go-memory-optimization skill and analysis documentation - Update DDD_ANALYSIS.md to reflect completed decomposition Files modified: - app/handle-event.go: Slim orchestrator using domain services - app/server.go: Service initialization and interface wrappers - app/handle-event-types.go: Shared types (OkHelper, result types) - pkg/event/validation/*: New validation service package - pkg/event/authorization/*: New authorization service package - pkg/event/routing/*: New routing service package - pkg/event/processing/*: New processing service package - pkg/database/*_test.go: Refactored to shared TestMain - pkg/blossom/http_test.go: Fixed URL format expectations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
367 lines
10 KiB
Markdown
367 lines
10 KiB
Markdown
# ORLY Relay Memory Optimization Analysis
|
||
|
||
This document analyzes ORLY's current memory optimization patterns against Go best practices for high-performance systems. The analysis covers buffer management, caching strategies, allocation patterns, and identifies optimization opportunities.
|
||
|
||
## Executive Summary
|
||
|
||
ORLY implements several sophisticated memory optimization strategies:
|
||
- **Compact event storage** achieving ~87% space savings via serial references
|
||
- **Two-level caching** for serial lookups and query results
|
||
- **ZSTD compression** for query cache with LRU eviction
|
||
- **Atomic operations** for lock-free statistics tracking
|
||
- **Pre-allocation patterns** for slice capacity management
|
||
|
||
However, several opportunities exist to further reduce GC pressure:
|
||
- Implement `sync.Pool` for frequently allocated buffers
|
||
- Use fixed-size arrays for cryptographic values
|
||
- Pool `bytes.Buffer` instances in hot paths
|
||
- Optimize escape behavior in serialization code
|
||
|
||
---
|
||
|
||
## Current Memory Patterns
|
||
|
||
### 1. Compact Event Storage
|
||
|
||
**Location**: `pkg/database/compact_event.go`
|
||
|
||
ORLY's most significant memory optimization is the compact binary format for event storage:
|
||
|
||
```
|
||
Original event: 32 (ID) + 32 (pubkey) + 32*4 (tags) = 192+ bytes
|
||
Compact format: 5 (pubkey serial) + 5*4 (tag serials) = 25 bytes
|
||
Savings: ~87% compression per event
|
||
```
|
||
|
||
**Key techniques:**
|
||
- 5-byte serial references replace 32-byte IDs/pubkeys
|
||
- Varint encoding for variable-length integers (CreatedAt, tag counts)
|
||
- Type flags for efficient deserialization
|
||
- Separate `SerialEventId` index for ID reconstruction
|
||
|
||
**Assessment**: Excellent storage optimization. This dramatically reduces database size and I/O costs.
|
||
|
||
### 2. Serial Cache System
|
||
|
||
**Location**: `pkg/database/serial_cache.go`
|
||
|
||
Two-way lookup cache for serial ↔ ID/pubkey mappings:
|
||
|
||
```go
|
||
type SerialCache struct {
|
||
pubkeyBySerial map[uint64][]byte // For decoding
|
||
serialByPubkeyHash map[string]uint64 // For encoding
|
||
eventIdBySerial map[uint64][]byte // For decoding
|
||
serialByEventIdHash map[string]uint64 // For encoding
|
||
}
|
||
```
|
||
|
||
**Memory footprint:**
|
||
- Pubkey cache: 100k entries × 32 bytes ≈ 3.2MB
|
||
- Event ID cache: 500k entries × 32 bytes ≈ 16MB
|
||
- Total: ~19-20MB overhead
|
||
|
||
**Strengths:**
|
||
- Fine-grained `RWMutex` locking per direction/type
|
||
- Configurable cache limits
|
||
- Defensive copying prevents external mutations
|
||
|
||
**Improvement opportunity:** The eviction strategy (clear 50% when full) is simple but not LRU. Consider ring buffers or generational caching for better hit rates.
|
||
|
||
### 3. Query Cache with ZSTD Compression
|
||
|
||
**Location**: `pkg/database/querycache/event_cache.go`
|
||
|
||
```go
|
||
type EventCache struct {
|
||
entries map[string]*EventCacheEntry
|
||
lruList *list.List
|
||
encoder *zstd.Encoder // Reused encoder (level 9)
|
||
decoder *zstd.Decoder // Reused decoder
|
||
maxSize int64 // Default 512MB compressed
|
||
}
|
||
```
|
||
|
||
**Strengths:**
|
||
- ZSTD level 9 compression (best ratio)
|
||
- Encoder/decoder reuse avoids repeated initialization
|
||
- LRU eviction with proper size tracking
|
||
- Background cleanup of expired entries
|
||
- Tracks compression ratio with exponential moving average
|
||
|
||
**Memory pattern:** Stores compressed data in cache, decompresses on-demand. This trades CPU for memory.
|
||
|
||
### 4. Buffer Allocation Patterns
|
||
|
||
**Current approach:** Uses `new(bytes.Buffer)` throughout serialization code:
|
||
|
||
```go
|
||
// pkg/database/save-event.go, compact_event.go, serial_cache.go
|
||
buf := new(bytes.Buffer)
|
||
// ... encode data
|
||
return buf.Bytes()
|
||
```
|
||
|
||
**Assessment:** Each call allocates a new buffer on the heap. For high-throughput scenarios (thousands of events/second), this creates significant GC pressure.
|
||
|
||
---
|
||
|
||
## Optimization Opportunities
|
||
|
||
### 1. Implement sync.Pool for Buffer Reuse
|
||
|
||
**Priority: High**
|
||
|
||
Currently, ORLY creates new `bytes.Buffer` instances for every serialization operation. A buffer pool would amortize allocation costs:
|
||
|
||
```go
|
||
// Recommended implementation
|
||
var bufferPool = sync.Pool{
|
||
New: func() interface{} {
|
||
return bytes.NewBuffer(make([]byte, 0, 4096))
|
||
},
|
||
}
|
||
|
||
func getBuffer() *bytes.Buffer {
|
||
return bufferPool.Get().(*bytes.Buffer)
|
||
}
|
||
|
||
func putBuffer(buf *bytes.Buffer) {
|
||
buf.Reset()
|
||
bufferPool.Put(buf)
|
||
}
|
||
```
|
||
|
||
**Impact areas:**
|
||
- `pkg/database/compact_event.go` - MarshalCompactEvent, encodeCompactTag
|
||
- `pkg/database/save-event.go` - index key generation
|
||
- `pkg/database/serial_cache.go` - GetEventIdBySerial, StoreEventIdSerial
|
||
|
||
**Expected benefit:** 50-80% reduction in buffer allocations on hot paths.
|
||
|
||
### 2. Fixed-Size Array Types for Cryptographic Values
|
||
|
||
**Priority: Medium**
|
||
|
||
The external nostr library uses `[]byte` slices for IDs, pubkeys, and signatures. However, these are always fixed sizes:
|
||
|
||
| Type | Size | Current | Recommended |
|
||
|------|------|---------|-------------|
|
||
| Event ID | 32 bytes | `[]byte` | `[32]byte` |
|
||
| Pubkey | 32 bytes | `[]byte` | `[32]byte` |
|
||
| Signature | 64 bytes | `[]byte` | `[64]byte` |
|
||
|
||
Internal types like `Uint40` already follow this pattern but use struct wrapping:
|
||
|
||
```go
|
||
// Current (pkg/database/indexes/types/uint40.go)
|
||
type Uint40 struct{ value uint64 }
|
||
|
||
// Already efficient - no slice allocation
|
||
```
|
||
|
||
For cryptographic values, consider wrapper types:
|
||
|
||
```go
|
||
type EventID [32]byte
|
||
type Pubkey [32]byte
|
||
type Signature [64]byte
|
||
|
||
func (id EventID) IsZero() bool { return id == EventID{} }
|
||
func (id EventID) Hex() string { return hex.Enc(id[:]) }
|
||
```
|
||
|
||
**Benefit:** Stack allocation for local variables, zero-value comparison efficiency.
|
||
|
||
### 3. Pre-allocated Slice Patterns
|
||
|
||
**Current usage is good:**
|
||
|
||
```go
|
||
// pkg/database/save-event.go:51-54
|
||
sers = make(types.Uint40s, 0, len(idxs)*100) // Estimate 100 serials per index
|
||
|
||
// pkg/database/compact_event.go:283
|
||
ev.Tags = tag.NewSWithCap(int(nTags)) // Pre-allocate tag slice
|
||
```
|
||
|
||
**Improvement:** Apply consistently to:
|
||
- `Uint40s.Union/Intersection/Difference` methods (currently use `append` without capacity hints)
|
||
- Query result accumulation in `query-events.go`
|
||
|
||
### 4. Escape Analysis Optimization
|
||
|
||
**Priority: Medium**
|
||
|
||
Several patterns cause unnecessary heap escapes. Check with:
|
||
|
||
```bash
|
||
go build -gcflags="-m -m" ./pkg/database/...
|
||
```
|
||
|
||
**Common escape causes in codebase:**
|
||
|
||
```go
|
||
// compact_event.go:224 - Small slice escapes
|
||
buf := make([]byte, 5) // Could be [5]byte on stack
|
||
|
||
// compact_event.go:335 - Single-byte slice escapes
|
||
typeBuf := make([]byte, 1) // Could be var typeBuf [1]byte
|
||
```
|
||
|
||
**Fix:**
|
||
```go
|
||
func readUint40(r io.Reader) (value uint64, err error) {
|
||
var buf [5]byte // Stack-allocated
|
||
if _, err = io.ReadFull(r, buf[:]); err != nil {
|
||
return 0, err
|
||
}
|
||
// ...
|
||
}
|
||
```
|
||
|
||
### 5. Atomic Bytes Wrapper Optimization
|
||
|
||
**Location**: `pkg/utils/atomic/bytes.go`
|
||
|
||
Current implementation copies on both Load and Store:
|
||
|
||
```go
|
||
func (x *Bytes) Load() (b []byte) {
|
||
vb := x.v.Load().([]byte)
|
||
b = make([]byte, len(vb)) // Allocation on every Load
|
||
copy(b, vb)
|
||
return
|
||
}
|
||
```
|
||
|
||
This is safe but expensive for high-frequency access. Consider:
|
||
- Read-copy-update (RCU) pattern for read-heavy workloads
|
||
- `sync.RWMutex` with direct access for controlled use cases
|
||
|
||
### 6. Goroutine Management
|
||
|
||
**Current patterns:**
|
||
- Worker goroutines for message processing (`app/listener.go`)
|
||
- Background cleanup goroutines (`querycache/event_cache.go`)
|
||
- Pinger goroutines per connection (`app/handle-websocket.go`)
|
||
|
||
**Assessment:** Good use of bounded channels and `sync.WaitGroup` for lifecycle management.
|
||
|
||
**Improvement:** Consider a worker pool for subscription handlers to limit peak goroutine count:
|
||
|
||
```go
|
||
type WorkerPool struct {
|
||
jobs chan func()
|
||
workers int
|
||
wg sync.WaitGroup
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Memory Budget Analysis
|
||
|
||
### Runtime Memory Breakdown
|
||
|
||
| Component | Estimated Size | Notes |
|
||
|-----------|---------------|-------|
|
||
| Serial Cache (pubkeys) | 3.2 MB | 100k × 32 bytes |
|
||
| Serial Cache (event IDs) | 16 MB | 500k × 32 bytes |
|
||
| Query Cache | 512 MB | Configurable, compressed |
|
||
| Per-connection state | ~10 KB | Channels, buffers, maps |
|
||
| Badger DB caches | Variable | Controlled by Badger config |
|
||
|
||
### GC Tuning Recommendations
|
||
|
||
For a relay handling 1000+ events/second:
|
||
|
||
```go
|
||
// main.go or init
|
||
import "runtime/debug"
|
||
|
||
func init() {
|
||
// More aggressive GC to limit heap growth
|
||
debug.SetGCPercent(50) // GC at 50% heap growth (default 100)
|
||
|
||
// Set soft memory limit based on available RAM
|
||
debug.SetMemoryLimit(2 << 30) // 2GB limit
|
||
}
|
||
```
|
||
|
||
Or via environment:
|
||
```bash
|
||
GOGC=50 GOMEMLIMIT=2GiB ./orly
|
||
```
|
||
|
||
---
|
||
|
||
## Profiling Commands
|
||
|
||
### Heap Profile
|
||
|
||
```bash
|
||
# Enable pprof (already supported)
|
||
ORLY_PPROF_HTTP=true ./orly
|
||
|
||
# Capture heap profile
|
||
go tool pprof http://localhost:6060/debug/pprof/heap
|
||
|
||
# Analyze allocations
|
||
go tool pprof -alloc_space heap.prof
|
||
go tool pprof -inuse_space heap.prof
|
||
```
|
||
|
||
### Escape Analysis
|
||
|
||
```bash
|
||
# Check which variables escape to heap
|
||
go build -gcflags="-m -m" ./pkg/database/... 2>&1 | grep "escapes to heap"
|
||
```
|
||
|
||
### Allocation Benchmarks
|
||
|
||
Add to existing benchmarks:
|
||
|
||
```go
|
||
func BenchmarkCompactMarshal(b *testing.B) {
|
||
b.ReportAllocs()
|
||
ev := createTestEvent()
|
||
resolver := &testResolver{}
|
||
|
||
b.ResetTimer()
|
||
for i := 0; i < b.N; i++ {
|
||
data, _ := MarshalCompactEvent(ev, resolver)
|
||
_ = data
|
||
}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Implementation Priority
|
||
|
||
1. **High Priority (Immediate Impact)**
|
||
- Implement `sync.Pool` for `bytes.Buffer` in serialization paths
|
||
- Replace small `make([]byte, n)` with fixed arrays in decode functions
|
||
|
||
2. **Medium Priority (Significant Improvement)**
|
||
- Add pre-allocation hints to set operation methods
|
||
- Optimize escape behavior in compact event encoding
|
||
- Consider worker pool for subscription handlers
|
||
|
||
3. **Low Priority (Refinement)**
|
||
- LRU-based serial cache eviction
|
||
- Fixed-size types for cryptographic values (requires nostr library changes)
|
||
- RCU pattern for atomic bytes in high-frequency paths
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
ORLY demonstrates thoughtful memory optimization in its storage layer, particularly the compact event format achieving 87% space savings. The dual-cache architecture (serial cache + query cache) balances memory usage with lookup performance.
|
||
|
||
The primary opportunity for improvement is in the serialization hot path, where buffer pooling could significantly reduce GC pressure. The recommended `sync.Pool` implementation would have immediate benefits for high-throughput deployments without requiring architectural changes.
|
||
|
||
Secondary improvements around escape analysis and fixed-size types would provide incremental gains and should be prioritized based on profiling data from production workloads.
|