This commit updates the BENCHMARK_REPORT.md to reflect the latest performance improvements following the implementation of optimized windowed multiplication for ECDH and verification. Key changes include a new generation date, updated operation times, and a detailed analysis of the performance of P256K1Signer, BtcecSigner, and NextP256K across various operations. Notably, P256K1Signer now shows significant improvements in ECDH (33% faster) and verification (20% faster), establishing it as the fastest pure Go implementation across all operations.
9.0 KiB
Benchmark Comparison Report
Signer Implementation Comparison
This report compares three signer implementations for secp256k1 operations:
- P256K1Signer - This repository's new port from Bitcoin Core secp256k1 (pure Go)
- BtcecSigner - Pure Go wrapper around btcec/v2
- NextP256K Signer - CGO version using next.orly.dev/pkg/crypto/p256k (CGO bindings to libsecp256k1)
Generated: 2025-11-02 (Updated after ECDH optimization with windowed multiplication)
Platform: linux/amd64
CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics
Go Version: go1.25.3
Key Optimizations:
- Implemented 8-bit byte-based precomputed tables matching btcec's approach, resulting in 4x improvement in pubkey derivation and 4.3x improvement in signing.
- Optimized windowed multiplication for verification (5-bit windows, Jacobian coordinate table building): 20% improvement (186,054 → 149,511 ns/op).
- Optimized ECDH with windowed multiplication (5-bit windows): 33% improvement (163,356 → 109,068 ns/op), now fastest for ECDH.
Summary Results
| Operation | P256K1Signer | BtcecSigner | NextP256K | Winner |
|---|---|---|---|---|
| Pubkey Derivation | 58,383 ns/op | 62,909 ns/op | 417,383 ns/op | P256K1 (8% faster than Btcec) |
| Sign | 63,421 ns/op | 218,085 ns/op | 52,273 ns/op | NextP256K (1.2x faster than P256K1) |
| Verify | 149,511 ns/op | 163,396 ns/op | 40,208 ns/op | NextP256K (3.7x faster) |
| ECDH | 109,068 ns/op | 127,739 ns/op | 124,039 ns/op | P256K1 (1.1x faster than NextP256K) |
Detailed Results
Public Key Derivation
Deriving public key from private key (32 bytes → 32 bytes x-only pubkey).
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|---|---|---|---|---|
| P256K1Signer | 58,383 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
| BtcecSigner | 62,909 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
| NextP256K | 417,383 ns/op | 983,395 B/op | 9 allocs/op | 0.1x slower |
Analysis:
- P256K1 is fastest (8% faster than Btcec) after implementing 8-bit byte-based precomputed tables
- Massive improvement: 4x faster than previous implementation (232,922 → 58,618 ns/op)
- NextP256K is slowest, likely due to CGO overhead for small operations
- P256K1 has lowest memory allocation overhead
Signing (Schnorr)
Creating BIP-340 Schnorr signatures (32-byte message → 64-byte signature).
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|---|---|---|---|---|
| P256K1Signer | 63,421 ns/op | 1,152 B/op | 17 allocs/op | 1.0x (baseline) |
| BtcecSigner | 218,085 ns/op | 2,193 B/op | 38 allocs/op | 0.3x slower |
| NextP256K | 52,273 ns/op | 128 B/op | 3 allocs/op | 1.2x faster |
Analysis:
- NextP256K is fastest (1.2x faster than P256K1), benefiting from optimized C implementation
- P256K1 is second fastest (3.4x faster than Btcec)
- Btcec is slowest, likely due to more allocations and pure Go overhead
- NextP256K has lowest memory usage (128 B vs 1,152 B)
Verification (Schnorr)
Verifying BIP-340 Schnorr signatures (32-byte message + 64-byte signature).
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|---|---|---|---|---|
| P256K1Signer | 149,511 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
| BtcecSigner | 163,396 ns/op | 1,121 B/op | 18 allocs/op | 0.9x slower |
| NextP256K | 40,208 ns/op | 96 B/op | 2 allocs/op | 3.7x faster |
Analysis:
- NextP256K is dramatically fastest (3.7x faster), showcasing CGO advantage for verification
- P256K1 is fastest pure Go implementation (8% faster than Btcec) after optimized windowed multiplication
- 20% improvement over previous implementation (186,054 → 149,511 ns/op)
- Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
- NextP256K has minimal memory footprint (96 B vs 576 B)
ECDH (Shared Secret Generation)
Generating shared secret using Elliptic Curve Diffie-Hellman.
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
|---|---|---|---|---|
| P256K1Signer | 109,068 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
| BtcecSigner | 127,739 ns/op | 832 B/op | 13 allocs/op | 0.9x slower |
| NextP256K | 124,039 ns/op | 160 B/op | 3 allocs/op | 0.9x slower |
Analysis:
- P256K1 is fastest (1.1x faster than NextP256K) after optimizing with windowed multiplication
- 33% improvement over previous implementation (163,356 → 109,068 ns/op)
- Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
- P256K1 has lowest memory usage (241 B)
Performance Analysis
Overall Winner: Mixed (P256K1 wins 2/4 operations, NextP256K wins 2/4 operations)
After optimized windowed multiplication for ECDH:
- P256K1Signer wins in 2 out of 4 operations:
- Pubkey Derivation: Fastest (8% faster than Btcec)
- ECDH: Fastest (1.1x faster than NextP256K) - 33% improvement!
- NextP256K wins in 2 operations:
- Signing: Fastest (1.2x faster than P256K1)
- Verification: Fastest (3.7x faster than P256K1, CGO advantage)
Best Pure Go: P256K1Signer
For pure Go implementations:
- P256K1 wins for key derivation (8% faster than Btcec)
- P256K1 wins for signing (3.4x faster than Btcec)
- P256K1 wins for verification (8% faster than Btcec) - fastest pure Go!
- P256K1 wins for ECDH (1.2x faster than Btcec) - now fastest pure Go!
Memory Efficiency
| Implementation | Avg Memory per Operation | Notes |
|---|---|---|
| P256K1Signer | ~500 B avg | Low memory footprint, consistent across operations |
| NextP256K | ~300 KB avg | Very efficient, minimal allocations (except pubkey derivation overhead) |
| BtcecSigner | ~1.1 KB avg | Higher allocations, but acceptable |
Note: NextP256K shows high memory in pubkey derivation (983 KB) due to one-time CGO initialization overhead, but this is amortized across operations.
Recommendations
Use NextP256K (CGO) when:
- Maximum performance is critical
- CGO is acceptable in your build environment
- Low memory footprint is important
- Verification speed is critical (4.7x faster)
Use P256K1Signer when:
- Pure Go is required (no CGO)
- Pubkey derivation, signing, verification, or ECDH performance is critical (now fastest pure Go for all operations!)
- Lower memory allocations are preferred
- You want to avoid external C dependencies
- You need the best overall pure Go performance
Use BtcecSigner when:
- Pure Go is required
- You're already using btcec in your project
- Note: P256K1Signer is faster across all operations
Conclusion
The benchmarks demonstrate that:
-
After optimized windowed multiplication for ECDH, P256K1Signer achieves:
- Fastest pubkey derivation among all implementations (58,383 ns/op)
- Fastest ECDH among all implementations (109,068 ns/op) - 33% improvement (163,356 → 109,068 ns/op)
- Fastest pure Go verification (149,511 ns/op) - 20% improvement (186,054 → 149,511 ns/op)
- Fastest pure Go signing (63,421 ns/op) - 3.4x faster than Btcec
-
Windowed multiplication optimization results:
- Implemented 5-bit windowed multiplication with efficient Jacobian coordinate table building
- Kept all operations in Jacobian coordinates to avoid expensive affine conversions
- Reduced iterations from 256 (bit-by-bit) to ~52 (5-bit windows)
- ECDH: 33% improvement (163,356 → 109,068 ns/op)
- Verification: 20% improvement (186,054 → 149,511 ns/op)
-
CGO implementations (NextP256K) still provide advantages for verification (3.7x faster) and signing (1.2x faster)
-
Pure Go implementations are highly competitive, with P256K1Signer leading in 2 out of 4 operations (pubkey derivation and ECDH)
-
Memory efficiency varies by operation, with P256K1Signer maintaining low memory usage (256 B for pubkey derivation, 241 B for ECDH)
The choice between implementations depends on your specific requirements:
- Maximum performance: Use NextP256K (CGO) - fastest for verification and signing
- Best pure Go performance: Use P256K1Signer - fastest for pubkey derivation and ECDH, fastest pure Go for all operations!
- Pure Go alternative: Use BtcecSigner (but P256K1Signer is faster across all operations)
Running the Benchmarks
To reproduce these benchmarks:
# Run all benchmarks
CGO_ENABLED=1 go test -tags=cgo ./bench -bench=. -benchmem
# Run specific operation
CGO_ENABLED=1 go test -tags=cgo ./bench -bench=BenchmarkSign
# Run specific implementation
CGO_ENABLED=1 go test -tags=cgo ./bench -bench=Benchmark.*_P256K1
Note: All benchmarks require CGO to be enabled (CGO_ENABLED=1) and the cgo build tag.