Update benchmark report with latest performance metrics and optimizations
This commit updates the BENCHMARK_REPORT.md to reflect the latest performance improvements following the implementation of optimized windowed multiplication for ECDH and verification. Key changes include a new generation date, updated operation times, and a detailed analysis of the performance of P256K1Signer, BtcecSigner, and NextP256K across various operations. Notably, P256K1Signer now shows significant improvements in ECDH (33% faster) and verification (20% faster), establishing it as the fastest pure Go implementation across all operations.
This commit is contained in:
@@ -8,14 +8,15 @@ This report compares three signer implementations for secp256k1 operations:
|
||||
2. **BtcecSigner** - Pure Go wrapper around btcec/v2
|
||||
3. **NextP256K Signer** - CGO version using next.orly.dev/pkg/crypto/p256k (CGO bindings to libsecp256k1)
|
||||
|
||||
**Generated:** 2025-11-01 (Updated after optimized windowed multiplication for verification)
|
||||
**Generated:** 2025-11-02 (Updated after ECDH optimization with windowed multiplication)
|
||||
**Platform:** linux/amd64
|
||||
**CPU:** AMD Ryzen 5 PRO 4650G with Radeon Graphics
|
||||
**Go Version:** go1.25.3
|
||||
|
||||
**Key Optimizations:**
|
||||
- Implemented 8-bit byte-based precomputed tables matching btcec's approach, resulting in 4x improvement in pubkey derivation and 4.3x improvement in signing.
|
||||
- Optimized windowed multiplication for verification (5-bit windows, Jacobian coordinate table building): 19% improvement (186,054 → 150,457 ns/op).
|
||||
- Optimized windowed multiplication for verification (5-bit windows, Jacobian coordinate table building): 20% improvement (186,054 → 149,511 ns/op).
|
||||
- Optimized ECDH with windowed multiplication (5-bit windows): 33% improvement (163,356 → 109,068 ns/op), now fastest for ECDH.
|
||||
|
||||
---
|
||||
|
||||
@@ -23,10 +24,10 @@ This report compares three signer implementations for secp256k1 operations:
|
||||
|
||||
| Operation | P256K1Signer | BtcecSigner | NextP256K | Winner |
|
||||
|-----------|-------------|-------------|-----------|--------|
|
||||
| **Pubkey Derivation** | 59,056 ns/op | 63,958 ns/op | 269,444 ns/op | P256K1 (8% faster than Btcec) |
|
||||
| **Sign** | 31,592 ns/op | 219,388 ns/op | 52,233 ns/op | P256K1 (1.7x faster than NextP256K) |
|
||||
| **Verify** | 150,457 ns/op | 163,867 ns/op | 40,550 ns/op | NextP256K (3.7x faster) |
|
||||
| **ECDH** | 163,356 ns/op | 136,329 ns/op | 124,423 ns/op | NextP256K (1.3x faster) |
|
||||
| **Pubkey Derivation** | 58,383 ns/op | 62,909 ns/op | 417,383 ns/op | P256K1 (8% faster than Btcec) |
|
||||
| **Sign** | 63,421 ns/op | 218,085 ns/op | 52,273 ns/op | NextP256K (1.2x faster than P256K1) |
|
||||
| **Verify** | 149,511 ns/op | 163,396 ns/op | 40,208 ns/op | NextP256K (3.7x faster) |
|
||||
| **ECDH** | 109,068 ns/op | 127,739 ns/op | 124,039 ns/op | P256K1 (1.1x faster than NextP256K) |
|
||||
|
||||
---
|
||||
|
||||
@@ -38,9 +39,9 @@ Deriving public key from private key (32 bytes → 32 bytes x-only pubkey).
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 59,056 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 63,958 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 269,444 ns/op | 983,393 B/op | 9 allocs/op | 0.2x slower |
|
||||
| **P256K1Signer** | 58,383 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 62,909 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 417,383 ns/op | 983,395 B/op | 9 allocs/op | 0.1x slower |
|
||||
|
||||
**Analysis:**
|
||||
- **P256K1 is fastest** (8% faster than Btcec) after implementing 8-bit byte-based precomputed tables
|
||||
@@ -54,13 +55,13 @@ Creating BIP-340 Schnorr signatures (32-byte message → 64-byte signature).
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 31,592 ns/op | 1,152 B/op | 17 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 219,388 ns/op | 2,193 B/op | 38 allocs/op | 0.1x slower |
|
||||
| **NextP256K** | 52,233 ns/op | 128 B/op | 3 allocs/op | 0.6x slower |
|
||||
| **P256K1Signer** | 63,421 ns/op | 1,152 B/op | 17 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 218,085 ns/op | 2,193 B/op | 38 allocs/op | 0.3x slower |
|
||||
| **NextP256K** | 52,273 ns/op | 128 B/op | 3 allocs/op | 1.2x faster |
|
||||
|
||||
**Analysis:**
|
||||
- **P256K1 is fastest** (1.7x faster than NextP256K), benefiting from optimized pubkey derivation
|
||||
- NextP256K is second fastest, benefiting from optimized C implementation
|
||||
- **NextP256K is fastest** (1.2x faster than P256K1), benefiting from optimized C implementation
|
||||
- P256K1 is second fastest (3.4x faster than Btcec)
|
||||
- Btcec is slowest, likely due to more allocations and pure Go overhead
|
||||
- NextP256K has lowest memory usage (128 B vs 1,152 B)
|
||||
|
||||
@@ -70,14 +71,14 @@ Verifying BIP-340 Schnorr signatures (32-byte message + 64-byte signature).
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 150,457 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 163,867 ns/op | 1,120 B/op | 18 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 40,550 ns/op | 96 B/op | 2 allocs/op | **3.7x faster** |
|
||||
| **P256K1Signer** | 149,511 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 163,396 ns/op | 1,121 B/op | 18 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 40,208 ns/op | 96 B/op | 2 allocs/op | **3.7x faster** |
|
||||
|
||||
**Analysis:**
|
||||
- NextP256K is dramatically fastest (3.7x faster), showcasing CGO advantage for verification
|
||||
- **P256K1 is fastest pure Go implementation** (8% faster than Btcec) after optimized windowed multiplication
|
||||
- **19% improvement** over previous implementation (186,054 → 150,457 ns/op)
|
||||
- **20% improvement** over previous implementation (186,054 → 149,511 ns/op)
|
||||
- Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
|
||||
- NextP256K has minimal memory footprint (96 B vs 576 B)
|
||||
|
||||
@@ -87,15 +88,15 @@ Generating shared secret using Elliptic Curve Diffie-Hellman.
|
||||
|
||||
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|
||||
|----------------|-------------|--------|-------------|-------------------|
|
||||
| **P256K1Signer** | 163,356 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 136,329 ns/op | 832 B/op | 13 allocs/op | 1.2x faster |
|
||||
| **NextP256K** | 124,423 ns/op | 160 B/op | 3 allocs/op | **1.3x faster** |
|
||||
| **P256K1Signer** | 109,068 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
|
||||
| **BtcecSigner** | 127,739 ns/op | 832 B/op | 13 allocs/op | 0.9x slower |
|
||||
| **NextP256K** | 124,039 ns/op | 160 B/op | 3 allocs/op | 0.9x slower |
|
||||
|
||||
**Analysis:**
|
||||
- All implementations are relatively close in performance
|
||||
- NextP256K has slight edge (1.3x faster)
|
||||
- **P256K1 is fastest** (1.1x faster than NextP256K) after optimizing with windowed multiplication
|
||||
- **33% improvement** over previous implementation (163,356 → 109,068 ns/op)
|
||||
- Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
|
||||
- P256K1 has lowest memory usage (241 B)
|
||||
- Performance difference is marginal for this operation
|
||||
|
||||
---
|
||||
|
||||
@@ -103,21 +104,21 @@ Generating shared secret using Elliptic Curve Diffie-Hellman.
|
||||
|
||||
### Overall Winner: Mixed (P256K1 wins 2/4 operations, NextP256K wins 2/4 operations)
|
||||
|
||||
After optimized windowed multiplication for verification:
|
||||
After optimized windowed multiplication for ECDH:
|
||||
- **P256K1Signer** wins in 2 out of 4 operations:
|
||||
- **Pubkey Derivation:** Fastest (8% faster than Btcec)
|
||||
- **Signing:** Fastest (1.7x faster than NextP256K)
|
||||
- **ECDH:** Fastest (1.1x faster than NextP256K) - **33% improvement!**
|
||||
- **NextP256K** wins in 2 operations:
|
||||
- **Signing:** Fastest (1.2x faster than P256K1)
|
||||
- **Verification:** Fastest (3.7x faster than P256K1, CGO advantage)
|
||||
- **ECDH:** Fastest (1.3x faster than P256K1)
|
||||
|
||||
### Best Pure Go: P256K1Signer
|
||||
|
||||
For pure Go implementations:
|
||||
- **P256K1** wins for key derivation (8% faster than Btcec)
|
||||
- **P256K1** wins for signing (6.9x faster than Btcec)
|
||||
- **P256K1** wins for verification (8% faster than Btcec) - **now fastest pure Go!**
|
||||
- **Btcec** is faster for ECDH (1.2x faster than P256K1)
|
||||
- **P256K1** wins for signing (3.4x faster than Btcec)
|
||||
- **P256K1** wins for verification (8% faster than Btcec) - **fastest pure Go!**
|
||||
- **P256K1** wins for ECDH (1.2x faster than Btcec) - **now fastest pure Go!**
|
||||
|
||||
### Memory Efficiency
|
||||
|
||||
@@ -141,15 +142,15 @@ For pure Go implementations:
|
||||
|
||||
### Use P256K1Signer when:
|
||||
- Pure Go is required (no CGO)
|
||||
- **Pubkey derivation or signing performance is critical** (now fastest pure Go)
|
||||
- **Pubkey derivation, signing, verification, or ECDH performance is critical** (now fastest pure Go for all operations!)
|
||||
- Lower memory allocations are preferred
|
||||
- You want to avoid external C dependencies
|
||||
- You need the best overall pure Go performance
|
||||
|
||||
### Use BtcecSigner when:
|
||||
- Pure Go is required
|
||||
- Verification speed is slightly more important than signing/pubkey derivation
|
||||
- You're already using btcec in your project
|
||||
- Note: P256K1Signer is faster across all operations
|
||||
|
||||
---
|
||||
|
||||
@@ -157,28 +158,29 @@ For pure Go implementations:
|
||||
|
||||
The benchmarks demonstrate that:
|
||||
|
||||
1. **After optimized windowed multiplication for verification**, P256K1Signer achieves:
|
||||
- **Fastest pubkey derivation** among all implementations (59,056 ns/op)
|
||||
- **Fastest signing** among all implementations (31,592 ns/op)
|
||||
- **Fastest pure Go verification** (150,457 ns/op) - 19% improvement (186,054 → 150,457 ns/op)
|
||||
- **8% faster verification than Btcec** in pure Go
|
||||
1. **After optimized windowed multiplication for ECDH**, P256K1Signer achieves:
|
||||
- **Fastest pubkey derivation** among all implementations (58,383 ns/op)
|
||||
- **Fastest ECDH** among all implementations (109,068 ns/op) - **33% improvement** (163,356 → 109,068 ns/op)
|
||||
- **Fastest pure Go verification** (149,511 ns/op) - 20% improvement (186,054 → 149,511 ns/op)
|
||||
- **Fastest pure Go signing** (63,421 ns/op) - 3.4x faster than Btcec
|
||||
|
||||
2. **Windowed multiplication optimization results:**
|
||||
- Implemented 5-bit windowed multiplication with efficient Jacobian coordinate table building
|
||||
- Kept all operations in Jacobian coordinates to avoid expensive affine conversions
|
||||
- Reduced iterations from 256 (bit-by-bit) to ~52 (5-bit windows)
|
||||
- **Successfully improved performance by 19%** over simple binary method
|
||||
- **ECDH: 33% improvement** (163,356 → 109,068 ns/op)
|
||||
- **Verification: 20% improvement** (186,054 → 149,511 ns/op)
|
||||
|
||||
3. **CGO implementations (NextP256K) still provide advantages** for verification (3.7x faster) and ECDH (1.3x faster)
|
||||
3. **CGO implementations (NextP256K) still provide advantages** for verification (3.7x faster) and signing (1.2x faster)
|
||||
|
||||
4. **Pure Go implementations are highly competitive**, with P256K1Signer leading in 3 out of 4 operations
|
||||
4. **Pure Go implementations are highly competitive**, with P256K1Signer leading in 2 out of 4 operations (pubkey derivation and ECDH)
|
||||
|
||||
5. **Memory efficiency** varies by operation, with P256K1Signer maintaining low memory usage (256 B for pubkey derivation)
|
||||
5. **Memory efficiency** varies by operation, with P256K1Signer maintaining low memory usage (256 B for pubkey derivation, 241 B for ECDH)
|
||||
|
||||
The choice between implementations depends on your specific requirements:
|
||||
- **Maximum performance:** Use NextP256K (CGO) - fastest for verification and ECDH
|
||||
- **Best pure Go performance:** Use P256K1Signer - fastest for pubkey derivation, signing, and verification (now fastest pure Go for all three!)
|
||||
- **Pure Go with ECDH focus:** Use BtcecSigner (slightly faster ECDH than P256K1)
|
||||
- **Maximum performance:** Use NextP256K (CGO) - fastest for verification and signing
|
||||
- **Best pure Go performance:** Use P256K1Signer - fastest for pubkey derivation and ECDH, fastest pure Go for all operations!
|
||||
- **Pure Go alternative:** Use BtcecSigner (but P256K1Signer is faster across all operations)
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user