update benchmark report
This commit is contained in:
@@ -8,15 +8,23 @@ This report compares three signer implementations for secp256k1 operations:
|
||||
2. **BtcecSigner** - Pure Go wrapper around btcec/v2
|
||||
3. **NextP256K Signer** - CGO version using next.orly.dev/pkg/crypto/p256k (CGO bindings to libsecp256k1)
|
||||
|
||||
**Generated:** 2025-11-02 (Updated after ECDH optimization with windowed multiplication)
|
||||
**Generated:** 2025-11-02 (Updated after comprehensive CPU optimizations)
|
||||
**Platform:** linux/amd64
|
||||
**CPU:** AMD Ryzen 5 PRO 4650G with Radeon Graphics
|
||||
**Go Version:** go1.25.3
|
||||
|
||||
**Key Optimizations:**
|
||||
- Implemented 8-bit byte-based precomputed tables matching btcec's approach, resulting in 4x improvement in pubkey derivation and 4.3x improvement in signing.
|
||||
- Optimized windowed multiplication for verification (5-bit windows, Jacobian coordinate table building): 20% improvement (186,054 → 149,511 ns/op).
|
||||
- Optimized ECDH with windowed multiplication (5-bit windows): 33% improvement (163,356 → 109,068 ns/op), now fastest for ECDH.
|
||||
- Optimized windowed multiplication for verification (6-bit windows, increased from 5-bit): 8% improvement (149,511 → 138,127 ns/op).
|
||||
- Optimized ECDH with windowed multiplication (6-bit windows): 5% improvement (109,068 → 103,345 ns/op).
|
||||
- **Major CPU optimizations (Nov 2025):**
|
||||
- Precomputed TaggedHash prefixes for common BIP-340 tags: 28% faster (310 → 230 ns/op)
|
||||
- Eliminated unnecessary copies in field element operations (mul/sqr): faster when magnitude ≤ 8
|
||||
- Optimized group element operations (toBytes/toStorage): in-place normalization to avoid copies
|
||||
- Optimized EcmultGen: pre-allocated group elements to reduce allocations
|
||||
- **Sign optimizations:** 54% faster (63,421 → 29,237 ns/op), 47% fewer allocations (17 → 9 allocs/op)
|
||||
- **Verify optimizations:** 8% faster (149,511 → 138,127 ns/op), 78% fewer allocations (9 → 2 allocs/op)
|
||||
- **Pubkey derivation:** 6% faster (58,383 → 55,091 ns/op), eliminated intermediate copies
|
||||
|
||||
---
|
||||
|
||||
@@ -24,10 +32,10 @@ This report compares three signer implementations for secp256k1 operations:
|
||||
|
||||
| Operation | P256K1Signer | BtcecSigner | NextP256K | Winner |
|
||||
|-----------|-------------|-------------|-----------|--------|
|
||||
| **Pubkey Derivation** | 58,383 ns/op | 62,909 ns/op | 417,383 ns/op | P256K1 (8% faster than Btcec) |
|
||||
| **Sign** | 63,421 ns/op | 218,085 ns/op | 52,273 ns/op | NextP256K (1.2x faster than P256K1) |
|
||||
| **Verify** | 149,511 ns/op | 163,396 ns/op | 40,208 ns/op | NextP256K (3.7x faster) |
|
||||
| **ECDH** | 109,068 ns/op | 127,739 ns/op | 124,039 ns/op | P256K1 (1.1x faster than NextP256K) |
|
||||
| **Pubkey Derivation** | 55,091 ns/op | 64,177 ns/op | 271,394 ns/op | P256K1 (14% faster than Btcec) |
|
||||
| **Sign** | 29,237 ns/op | 225,514 ns/op | 53,015 ns/op | P256K1 (1.8x faster than NextP256K) |
|
||||
| **Verify** | 138,127 ns/op | 177,622 ns/op | 44,776 ns/op | NextP256K (3.1x faster) |
|
||||
| **ECDH** | 103,345 ns/op | 129,392 ns/op | 125,835 ns/op | P256K1 (1.2x faster than NextP256K) |
|
||||
|
||||
---
|
||||
|
||||
@@ -39,15 +47,16 @@ Deriving public key from private key (32 bytes → 32 bytes x-only pubkey).
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 58,383 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 62,909 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 417,383 ns/op | 983,395 B/op | 9 allocs/op | 0.1x slower |
|
||||
| **P256K1Signer** | 55,091 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 64,177 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 271,394 ns/op | 983,394 B/op | 9 allocs/op | 0.2x slower |
|
||||
|
||||
**Analysis:**
|
||||
- **P256K1 is fastest** (8% faster than Btcec) after implementing 8-bit byte-based precomputed tables
|
||||
- Massive improvement: 4x faster than previous implementation (232,922 → 58,618 ns/op)
|
||||
- **P256K1 is fastest** (14% faster than Btcec) after implementing 8-bit byte-based precomputed tables
|
||||
- **6% improvement** from CPU optimizations (58,383 → 55,091 ns/op)
|
||||
- Massive improvement: 4x faster than original implementation (232,922 → 55,091 ns/op)
|
||||
- NextP256K is slowest, likely due to CGO overhead for small operations
|
||||
- P256K1 has lowest memory allocation overhead
|
||||
- P256K1 has lowest memory allocation overhead (256 B vs 368 B)
|
||||
|
||||
### Signing (Schnorr)
|
||||
|
||||
@@ -55,15 +64,17 @@ Creating BIP-340 Schnorr signatures (32-byte message → 64-byte signature).
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 63,421 ns/op | 1,152 B/op | 17 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 218,085 ns/op | 2,193 B/op | 38 allocs/op | 0.3x slower |
|
||||
| **NextP256K** | 52,273 ns/op | 128 B/op | 3 allocs/op | 1.2x faster |
|
||||
| **P256K1Signer** | 29,237 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 225,514 ns/op | 2,193 B/op | 38 allocs/op | 0.1x slower |
|
||||
| **NextP256K** | 53,015 ns/op | 128 B/op | 3 allocs/op | 0.6x slower |
|
||||
|
||||
**Analysis:**
|
||||
- **NextP256K is fastest** (1.2x faster than P256K1), benefiting from optimized C implementation
|
||||
- P256K1 is second fastest (3.4x faster than Btcec)
|
||||
- Btcec is slowest, likely due to more allocations and pure Go overhead
|
||||
- NextP256K has lowest memory usage (128 B vs 1,152 B)
|
||||
- **P256K1 is fastest** (1.8x faster than NextP256K) after comprehensive CPU optimizations
|
||||
- **54% improvement** from optimizations (63,421 → 29,237 ns/op)
|
||||
- **47% reduction in allocations** (17 → 9 allocs/op)
|
||||
- P256K1 is 7.7x faster than Btcec
|
||||
- Optimizations: precomputed TaggedHash prefixes, eliminated intermediate copies, optimized hash operations
|
||||
- NextP256K has lowest memory usage (128 B vs 576 B) but P256K1 is significantly faster
|
||||
|
||||
### Verification (Schnorr)
|
||||
|
||||
@@ -71,16 +82,18 @@ Verifying BIP-340 Schnorr signatures (32-byte message + 64-byte signature).
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 149,511 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 163,396 ns/op | 1,121 B/op | 18 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 40,208 ns/op | 96 B/op | 2 allocs/op | **3.7x faster** |
|
||||
| **P256K1Signer** | 138,127 ns/op | 64 B/op | 2 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 177,622 ns/op | 1,120 B/op | 18 allocs/op | 0.8x slower |
|
||||
| **NextP256K** | 44,776 ns/op | 96 B/op | 2 allocs/op | **3.1x faster** |
|
||||
|
||||
**Analysis:**
|
||||
- NextP256K is dramatically fastest (3.7x faster), showcasing CGO advantage for verification
|
||||
- **P256K1 is fastest pure Go implementation** (8% faster than Btcec) after optimized windowed multiplication
|
||||
- **20% improvement** over previous implementation (186,054 → 149,511 ns/op)
|
||||
- Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
|
||||
- NextP256K has minimal memory footprint (96 B vs 576 B)
|
||||
- NextP256K is dramatically fastest (3.1x faster), showcasing CGO advantage for verification
|
||||
- **P256K1 is fastest pure Go implementation** (22% faster than Btcec) after comprehensive optimizations
|
||||
- **8% improvement** from CPU optimizations (149,511 → 138,127 ns/op)
|
||||
- **78% reduction in allocations** (9 → 2 allocs/op), **89% reduction in memory** (576 → 64 B/op)
|
||||
- **Total improvement:** 26% faster than original (186,054 → 138,127 ns/op)
|
||||
- Optimizations: 6-bit windowed multiplication (increased from 5-bit), precomputed TaggedHash, eliminated intermediate copies
|
||||
- P256K1 now has minimal memory footprint (64 B vs 96 B for NextP256K)
|
||||
|
||||
### ECDH (Shared Secret Generation)
|
||||
|
||||
@@ -88,64 +101,72 @@ Generating shared secret using Elliptic Curve Diffie-Hellman.
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 109,068 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 127,739 ns/op | 832 B/op | 13 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 124,039 ns/op | 160 B/op | 3 allocs/op | 0.9x slower |
|
||||
| **P256K1Signer** | 103,345 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 129,392 ns/op | 832 B/op | 13 allocs/op | 0.8x slower |
|
||||
| **NextP256K** | 125,835 ns/op | 160 B/op | 3 allocs/op | 0.8x slower |
|
||||
|
||||
**Analysis:**
|
||||
- **P256K1 is fastest** (1.1x faster than NextP256K) after optimizing with windowed multiplication
|
||||
- **33% improvement** over previous implementation (163,356 → 109,068 ns/op)
|
||||
- Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
|
||||
- P256K1 has lowest memory usage (241 B)
|
||||
- **P256K1 is fastest** (1.2x faster than NextP256K) after optimizing with windowed multiplication
|
||||
- **5% improvement** from CPU optimizations (109,068 → 103,345 ns/op)
|
||||
- **Total improvement:** 37% faster than original (163,356 → 103,345 ns/op)
|
||||
- Optimizations: 6-bit windowed multiplication (increased from 5-bit), optimized field operations
|
||||
- P256K1 has lowest memory usage (241 B vs 832 B for Btcec)
|
||||
|
||||
---
|
||||
|
||||
## Performance Analysis
|
||||
|
||||
### Overall Winner: Mixed (P256K1 wins 2/4 operations, NextP256K wins 2/4 operations)
|
||||
### Overall Winner: Mixed (P256K1 wins 3/4 operations, NextP256K wins 1/4 operations)
|
||||
|
||||
After optimized windowed multiplication for ECDH:
|
||||
- **P256K1Signer** wins in 2 out of 4 operations:
|
||||
- **Pubkey Derivation:** Fastest (8% faster than Btcec)
|
||||
- **ECDH:** Fastest (1.1x faster than NextP256K) - **33% improvement!**
|
||||
- **NextP256K** wins in 2 operations:
|
||||
- **Signing:** Fastest (1.2x faster than P256K1)
|
||||
- **Verification:** Fastest (3.7x faster than P256K1, CGO advantage)
|
||||
After comprehensive CPU optimizations:
|
||||
- **P256K1Signer** wins in 3 out of 4 operations:
|
||||
- **Pubkey Derivation:** Fastest (14% faster than Btcec) - **6% improvement**
|
||||
- **Signing:** Fastest (1.8x faster than NextP256K) - **54% improvement!**
|
||||
- **ECDH:** Fastest (1.2x faster than NextP256K) - **5% improvement**
|
||||
- **NextP256K** wins in 1 operation:
|
||||
- **Verification:** Fastest (3.1x faster than P256K1, CGO advantage) - but P256K1 is 8% faster than before
|
||||
|
||||
### Best Pure Go: P256K1Signer
|
||||
|
||||
For pure Go implementations:
|
||||
- **P256K1** wins for key derivation (8% faster than Btcec)
|
||||
- **P256K1** wins for signing (3.4x faster than Btcec)
|
||||
- **P256K1** wins for verification (8% faster than Btcec) - **fastest pure Go!**
|
||||
- **P256K1** wins for ECDH (1.2x faster than Btcec) - **now fastest pure Go!**
|
||||
- **P256K1** wins for key derivation (14% faster than Btcec) - **6% improvement**
|
||||
- **P256K1** wins for signing (7.7x faster than Btcec) - **54% improvement!**
|
||||
- **P256K1** wins for verification (22% faster than Btcec) - **fastest pure Go!** (**8% improvement**)
|
||||
- **P256K1** wins for ECDH (1.25x faster than Btcec) - **fastest pure Go!** (**5% improvement**)
|
||||
|
||||
### Memory Efficiency
|
||||
|
||||
| Implementation | Avg Memory per Operation | Notes |
|
||||
|----------------|-------------------------|-------|
|
||||
| **P256K1Signer** | ~500 B avg | Low memory footprint, consistent across operations |
|
||||
| **P256K1Signer** | ~270 B avg | Low memory footprint, significantly reduced after optimizations |
|
||||
| **NextP256K** | ~300 KB avg | Very efficient, minimal allocations (except pubkey derivation overhead) |
|
||||
| **BtcecSigner** | ~1.1 KB avg | Higher allocations, but acceptable |
|
||||
|
||||
**Note:** NextP256K shows high memory in pubkey derivation (983 KB) due to one-time CGO initialization overhead, but this is amortized across operations.
|
||||
|
||||
**Memory Improvements:**
|
||||
- **Sign:** 1,152 → 576 B/op (50% reduction)
|
||||
- **Verify:** 576 → 64 B/op (89% reduction!)
|
||||
- **Pubkey Derivation:** Already optimized (256 B/op)
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Use NextP256K (CGO) when:
|
||||
- Maximum performance is critical
|
||||
- Maximum verification performance is critical (3.1x faster than P256K1)
|
||||
- CGO is acceptable in your build environment
|
||||
- Low memory footprint is important
|
||||
- Verification speed is critical (4.7x faster)
|
||||
- Verification speed is critical (3.1x faster)
|
||||
|
||||
### Use P256K1Signer when:
|
||||
- Pure Go is required (no CGO)
|
||||
- **Pubkey derivation, signing, verification, or ECDH performance is critical** (now fastest pure Go for all operations!)
|
||||
- Lower memory allocations are preferred
|
||||
- **Signing performance is critical** (1.8x faster than NextP256K, 7.7x faster than Btcec)
|
||||
- **Pubkey derivation, verification, or ECDH performance is critical** (fastest pure Go for all operations!)
|
||||
- Lower memory allocations are preferred (64 B for verify, 576 B for sign)
|
||||
- You want to avoid external C dependencies
|
||||
- You need the best overall pure Go performance
|
||||
- **Now competitive with CGO for signing** (faster than NextP256K)
|
||||
|
||||
### Use BtcecSigner when:
|
||||
- Pure Go is required
|
||||
@@ -158,29 +179,39 @@ For pure Go implementations:
|
||||
|
||||
The benchmarks demonstrate that:
|
||||
|
||||
1. **After optimized windowed multiplication for ECDH**, P256K1Signer achieves:
|
||||
- **Fastest pubkey derivation** among all implementations (58,383 ns/op)
|
||||
- **Fastest ECDH** among all implementations (109,068 ns/op) - **33% improvement** (163,356 → 109,068 ns/op)
|
||||
- **Fastest pure Go verification** (149,511 ns/op) - 20% improvement (186,054 → 149,511 ns/op)
|
||||
- **Fastest pure Go signing** (63,421 ns/op) - 3.4x faster than Btcec
|
||||
1. **After comprehensive CPU optimizations**, P256K1Signer achieves:
|
||||
- **Fastest pubkey derivation** among all implementations (55,091 ns/op) - **6% improvement**
|
||||
- **Fastest signing** among all implementations (29,237 ns/op) - **54% improvement!** (63,421 → 29,237 ns/op)
|
||||
- **Fastest ECDH** among all implementations (103,345 ns/op) - **5% improvement** (109,068 → 103,345 ns/op)
|
||||
- **Fastest pure Go verification** (138,127 ns/op) - **8% improvement** (149,511 → 138,127 ns/op)
|
||||
- **Now faster than NextP256K for signing** (1.8x faster!)
|
||||
|
||||
2. **Windowed multiplication optimization results:**
|
||||
- Implemented 5-bit windowed multiplication with efficient Jacobian coordinate table building
|
||||
- Kept all operations in Jacobian coordinates to avoid expensive affine conversions
|
||||
- Reduced iterations from 256 (bit-by-bit) to ~52 (5-bit windows)
|
||||
- **ECDH: 33% improvement** (163,356 → 109,068 ns/op)
|
||||
- **Verification: 20% improvement** (186,054 → 149,511 ns/op)
|
||||
2. **CPU optimization results (Nov 2025):**
|
||||
- Precomputed TaggedHash prefixes: 28% faster (310 → 230 ns/op)
|
||||
- Increased window size from 5-bit to 6-bit: fewer iterations (~43 vs ~52 windows)
|
||||
- Eliminated unnecessary copies in field/group operations
|
||||
- Optimized memory allocations: 78% reduction in verify (9 → 2 allocs/op), 47% reduction in sign (17 → 9 allocs/op)
|
||||
- **Sign: 54% faster** (63,421 → 29,237 ns/op)
|
||||
- **Verify: 8% faster** (149,511 → 138,127 ns/op), **89% less memory** (576 → 64 B/op)
|
||||
- **Pubkey Derivation: 6% faster** (58,383 → 55,091 ns/op)
|
||||
- **ECDH: 5% faster** (109,068 → 103,345 ns/op)
|
||||
|
||||
3. **CGO implementations (NextP256K) still provide advantages** for verification (3.7x faster) and signing (1.2x faster)
|
||||
3. **CGO implementations (NextP256K) still provide advantages** for verification (3.1x faster) but P256K1 is now faster for signing
|
||||
|
||||
4. **Pure Go implementations are highly competitive**, with P256K1Signer leading in 2 out of 4 operations (pubkey derivation and ECDH)
|
||||
4. **Pure Go implementations are highly competitive**, with P256K1Signer leading in 3 out of 4 operations (pubkey derivation, signing, ECDH)
|
||||
|
||||
5. **Memory efficiency** varies by operation, with P256K1Signer maintaining low memory usage (256 B for pubkey derivation, 241 B for ECDH)
|
||||
5. **Memory efficiency** significantly improved, with P256K1Signer maintaining very low memory usage:
|
||||
- Verify: 64 B/op (89% reduction!)
|
||||
- Sign: 576 B/op (50% reduction)
|
||||
- Pubkey Derivation: 256 B/op
|
||||
- ECDH: 241 B/op
|
||||
|
||||
The choice between implementations depends on your specific requirements:
|
||||
- **Maximum performance:** Use NextP256K (CGO) - fastest for verification and signing
|
||||
- **Best pure Go performance:** Use P256K1Signer - fastest for pubkey derivation and ECDH, fastest pure Go for all operations!
|
||||
- **Pure Go alternative:** Use BtcecSigner (but P256K1Signer is faster across all operations)
|
||||
- **Maximum verification performance:** Use NextP256K (CGO) - 3.1x faster for verification
|
||||
- **Maximum signing performance:** Use P256K1Signer (Pure Go) - 1.8x faster than NextP256K, 7.7x faster than Btcec!
|
||||
- **Best pure Go performance:** Use P256K1Signer - fastest pure Go for all operations, now competitive with CGO for signing
|
||||
- **Best overall performance:** Use P256K1Signer - wins 3 out of 4 operations, fastest overall for signing
|
||||
- **Pure Go alternative:** Use BtcecSigner (but P256K1Signer is significantly faster across all operations)
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user