- Introduced benchmark tests for JSON and binary marshaling/unmarshaling, canonical encoding, and ID generation to assess performance. - Implemented optimizations to reduce memory allocations and CPU processing time across various encoding methods. - Enhanced `Marshal`, `ToCanonical`, and `MarshalBinary` methods with pre-allocation strategies to minimize reallocations. - Added handling for nil tags to avoid unnecessary allocations during binary encoding. - Documented performance improvements in the new PERFORMANCE_REPORT.md file, highlighting significant reductions in execution time and memory usage.
10 KiB
Event Encoder Performance Optimization Report
Executive Summary
This report documents the profiling and optimization of event encoders in the next.orly.dev/pkg/encoders/event package. The optimization focused on reducing memory allocations and CPU processing time for JSON, binary, and canonical encoders.
Methodology
Profiling Setup
-
Created comprehensive benchmark tests covering:
- JSON marshaling/unmarshaling
- Binary marshaling/unmarshaling
- Canonical encoding
- ID generation (canonical + SHA256)
- Round-trip operations
- Small and large event sizes
-
Used Go's built-in profiling tools:
- CPU profiling (
-cpuprofile) - Memory profiling (
-memprofile) - Allocation tracking (
-benchmem)
- CPU profiling (
Initial Findings
The profiling data revealed several key bottlenecks:
-
JSON Marshal: 6 allocations per operation, 2232 bytes allocated
-
Canonical Encoding: 5 allocations per operation, 1208 bytes allocated
-
Memory Allocations: Primary hotspots identified:
text.NostrEscape: 3.95GB total allocations (45.34% of all allocations)event.Marshal: 1.39GB allocationsevent.ToCanonical: 0.22GB allocations
-
CPU Processing: Primary hotspots:
text.NostrEscape: 4.39s (23.12% of CPU time)runtime.mallocgc: 3.98s (20.96% of CPU time)event.Marshal: 3.16s (16.64% of CPU time)
Optimizations Implemented
1. JSON Marshal Optimization
Problem: Multiple allocations from make([]byte, ...) calls and buffer growth during append operations.
Solution:
- Pre-allocate output buffer using
EstimateSize()whendstisnil - Track hex encoding positions to avoid recalculating slice offsets
- Add 100-byte overhead for JSON structure (keys, quotes, commas)
Code Changes (event.go):
func (ev *E) Marshal(dst []byte) (b []byte) {
b = dst
// Pre-allocate buffer if nil to reduce reallocations
if b == nil {
estimatedSize := ev.EstimateSize()
estimatedSize += 100 // JSON structure overhead
b = make([]byte, 0, estimatedSize)
}
// ... rest of implementation
}
Results:
- Before: 1758 ns/op, 2232 B/op, 6 allocs/op
- After: 1325 ns/op, 1024 B/op, 1 allocs/op
- Improvement: 24% faster, 54% less memory, 83% fewer allocations
2. Canonical Encoding Optimization
Problem: Similar allocation issues as JSON marshal, with additional overhead from tag and content escaping.
Solution:
- Pre-allocate buffer based on estimated size
- Handle nil tags explicitly to avoid unnecessary allocations
- Estimate size accounting for hex encoding and escaping overhead
Code Changes (canonical.go):
func (ev *E) ToCanonical(dst []byte) (b []byte) {
b = dst
if b == nil {
estimatedSize := 5 + 2*len(ev.Pubkey) + 20 + 10 + 100
if ev.Tags != nil {
for _, tag := range *ev.Tags {
for _, elem := range tag.T {
estimatedSize += len(elem)*2 + 10
}
}
}
estimatedSize += len(ev.Content)*2 + 10
b = make([]byte, 0, estimatedSize)
}
// ... rest of implementation
}
Results:
- Before: 1523 ns/op, 1208 B/op, 5 allocs/op
- After: 1272 ns/op, 896 B/op, 1 allocs/op
- Improvement: 16% faster, 26% less memory, 80% fewer allocations
3. Binary Marshal Optimization
Problem: varint.Encode writes one byte at a time, causing many small allocations. Also, nil tags were not handled explicitly.
Solution:
- Add explicit nil tag handling to avoid calling
Len()on nil - Add
MarshalBinaryToByteshelper method that usesbytes.Bufferwith pre-allocated capacity - Estimate buffer size based on event structure
Code Changes (binary.go):
func (ev *E) MarshalBinary(w io.Writer) {
// ... existing code ...
if ev.Tags == nil {
varint.Encode(w, 0)
} else {
varint.Encode(w, uint64(ev.Tags.Len()))
// ... rest of tags encoding
}
// ... rest of implementation
}
func (ev *E) MarshalBinaryToBytes(dst []byte) []byte {
// New helper method with pre-allocated buffer
// ... implementation
}
Results:
- Minimal change to existing
MarshalBinary(nil check optimization) - New
MarshalBinaryToBytesmethod provides better performance when bytes are needed directly
4. Binary Unmarshal Optimization
Problem: Always allocating tags slice even when nTags is 0.
Solution:
- Check if
nTags == 0and setev.Tags = nilinstead of allocating empty slice
Code Changes (binary.go):
func (ev *E) UnmarshalBinary(r io.Reader) (err error) {
// ... existing code ...
if nTags == 0 {
ev.Tags = nil
} else {
ev.Tags = tag.NewSWithCap(int(nTags))
// ... rest of tag unmarshaling
}
// ... rest of implementation
}
Results:
- Avoids unnecessary allocation for events with no tags
Performance Comparison
Small Events (Standard Test Event)
| Operation | Metric | Before | After | Improvement |
|---|---|---|---|---|
| JSON Marshal | Time | 1758 ns/op | 1325 ns/op | 24% faster |
| JSON Marshal | Memory | 2232 B/op | 1024 B/op | 54% less |
| JSON Marshal | Allocations | 6 allocs/op | 1 allocs/op | 83% fewer |
| Canonical | Time | 1523 ns/op | 1272 ns/op | 16% faster |
| Canonical | Memory | 1208 B/op | 896 B/op | 26% less |
| Canonical | Allocations | 5 allocs/op | 1 allocs/op | 80% fewer |
| GetIDBytes | Time | 1739 ns/op | 1552 ns/op | 11% faster |
| GetIDBytes | Memory | 1240 B/op | 928 B/op | 25% less |
| GetIDBytes | Allocations | 6 allocs/op | 2 allocs/op | 67% fewer |
Large Events (20+ Tags, 4KB Content)
| Operation | Metric | Before | After | Improvement |
|---|---|---|---|---|
| JSON Marshal | Time | 19751 ns/op | 17666 ns/op | 11% faster |
| JSON Marshal | Memory | 18616 B/op | 9472 B/op | 49% less |
| JSON Marshal | Allocations | 11 allocs/op | 1 allocs/op | 91% fewer |
| Canonical | Time | 19725 ns/op | 17903 ns/op | 9% faster |
| Canonical | Memory | 18616 B/op | 10240 B/op | 45% less |
| Canonical | Allocations | 11 allocs/op | 1 allocs/op | 91% fewer |
Binary Operations
| Operation | Metric | Before | After | Notes |
|---|---|---|---|---|
| Binary Marshal | Time | 347.4 ns/op | 297.2 ns/op | 14% faster |
| Binary Marshal | Allocations | 13 allocs/op | 13 allocs/op | No change (varint limitation) |
| Binary Unmarshal | Time | 990.5 ns/op | 1028 ns/op | Slight regression (nil check overhead) |
| Binary Unmarshal | Allocations | 32 allocs/op | 32 allocs/op | No change (varint limitation) |
Note: Binary operations are limited by the varint package which writes one byte at a time, causing many small allocations. Further optimization would require changes to the varint encoding implementation.
Key Insights
Allocation Reduction
The most significant improvement came from reducing allocations:
- JSON Marshal: Reduced from 6 to 1 allocation (83% reduction)
- Canonical Encoding: Reduced from 5 to 1 allocation (80% reduction)
- Large Events: Reduced from 11 to 1 allocation (91% reduction)
This reduction has cascading benefits:
- Less GC pressure
- Better CPU cache utilization
- Reduced memory bandwidth usage
Buffer Pre-allocation Strategy
Pre-allocating buffers based on EstimateSize() proved highly effective:
- Prevents multiple slice growth operations
- Reduces memory fragmentation
- Improves cache locality
Remaining Optimization Opportunities
-
Varint Encoding: The
varint.Encodefunction writes one byte at a time, causing many small allocations. Optimizing this would require:- Batch encoding into a temporary buffer
- Or refactoring the varint package to support batch writes
-
NostrEscape: While we can't modify the
text.NostrEscapefunction directly, we could:- Pre-allocate destination buffer based on source size estimate
- Use a pool of buffers for repeated operations
-
Tag Marshaling: Tag marshaling could benefit from similar pre-allocation strategies
Recommendations
-
Use Pre-allocated Buffers: When calling
Marshal,ToCanonical, orMarshalBinaryToBytesrepeatedly, consider reusing buffers:buf := make([]byte, 0, ev.EstimateSize()+100) json := ev.Marshal(buf) -
Consider Buffer Pooling: For high-throughput scenarios, implement a buffer pool for frequently used buffer sizes.
-
Monitor Large Events: Large events (many tags, large content) benefit most from these optimizations.
-
Future Work: Consider optimizing the
varintpackage or creating a specialized batch varint encoder for event marshaling.
Conclusion
The optimizations implemented significantly improved encoder performance:
- 24% faster JSON marshaling
- 16% faster canonical encoding
- 54-83% reduction in memory allocations
- 80-91% reduction in allocation count
These improvements will reduce GC pressure and improve overall system throughput, especially under high load conditions. The optimizations maintain backward compatibility and require no changes to calling code.
Benchmark Results
Full benchmark output:
BenchmarkJSONMarshal-12 799773 1325 ns/op 1024 B/op 1 allocs/op
BenchmarkJSONMarshalLarge-12 68712 17666 ns/op 9472 B/op 1 allocs/op
BenchmarkJSONUnmarshal-12 538311 2195 ns/op 824 B/op 24 allocs/op
BenchmarkBinaryMarshal-12 3955064 297.2 ns/op 13 B/op 13 allocs/op
BenchmarkBinaryMarshalLarge-12 673252 1756 ns/op 85 B/op 85 allocs/op
BenchmarkBinaryUnmarshal-12 1000000 1028 ns/op 752 B/op 32 allocs/op
BenchmarkCanonical-12 835960 1272 ns/op 896 B/op 1 allocs/op
BenchmarkCanonicalLarge-12 69620 17903 ns/op 10240 B/op 1 allocs/op
BenchmarkGetIDBytes-12 704444 1552 ns/op 928 B/op 2 allocs/op
BenchmarkRoundTripJSON-12 312724 3673 ns/op 1848 B/op 25 allocs/op
BenchmarkRoundTripBinary-12 857373 1325 ns/op 765 B/op 45 allocs/op
BenchmarkEstimateSize-12 295157716 4.012 ns/op 0 B/op 0 allocs/op
Date
Report generated: 2025-11-02