12 KiB
Benchmark Report: p256k1 Implementation Comparison
This report compares performance of different secp256k1 implementations:
- Pure Go - p256k1 with assembly disabled (baseline)
- AVX2/ASM - p256k1 with x86-64 assembly enabled (scalar and field operations)
- libsecp256k1 - Bitcoin Core's C library via purego (no CGO)
- Default - p256k1 with automatic feature detection
Test Environment
- Platform: Linux 6.8.0 (amd64)
- CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics (12 threads)
- Go Version: go1.23+
- Date: 2025-11-28
High-Level Operation Benchmarks
| Operation | Pure Go | AVX2 | libsecp256k1 | Default |
|---|---|---|---|---|
| Pubkey Derivation | 56.09 µs | 55.72 µs | 20.84 µs | 54.03 µs |
| Sign | 56.18 µs | 56.00 µs | 39.92 µs | 28.92 µs |
| Verify | 144.01 µs | 139.55 µs | 42.10 µs | 139.22 µs |
| ECDH | 107.80 µs | 106.30 µs | N/A | 104.53 µs |
Relative Performance (vs Pure Go)
| Operation | AVX2 | libsecp256k1 |
|---|---|---|
| Pubkey Derivation | 1.01x faster | 2.69x faster |
| Sign | 1.00x | 1.41x faster |
| Verify | 1.03x faster | 3.42x faster |
| ECDH | 1.01x faster | N/A |
Scalar Operation Benchmarks (Isolated)
These benchmarks measure the individual scalar arithmetic operations in isolation:
| Operation | Pure Go | x86-64 Assembly | Speedup |
|---|---|---|---|
| Scalar Multiply | 46.52 ns | 30.49 ns | 1.53x faster |
| Scalar Add | 5.29 ns | 4.69 ns | 1.13x faster |
The x86-64 scalar multiplication shows a 53% improvement over pure Go, demonstrating the effectiveness of the optimized 512-bit reduction algorithm.
Field Operation Benchmarks (Isolated)
Field operations (modular arithmetic over the secp256k1 prime field) dominate elliptic curve computations. These benchmarks measure the assembly-optimized field multiplication and squaring:
| Operation | Pure Go | x86-64 Assembly | Speedup |
|---|---|---|---|
| Field Multiply | 27.5 ns | 26.0 ns | 1.06x faster |
| Field Square | 27.5 ns | 21.7 ns | 1.27x faster |
The field squaring assembly shows a 21% improvement because it exploits the symmetry of squaring (computing 2·a[i]·a[j] once instead of a[i]·a[j] + a[j]·a[i]).
Why Field Assembly Speedup is More Modest
The field multiplication assembly provides a smaller speedup than scalar multiplication because:
- Go's uint128 emulation is efficient: The pure Go implementation uses
bits.Mul64andbits.Add64which compile to efficient machine code - No SIMD opportunity: Field multiplication requires sequential 128-bit accumulator operations that don't parallelize well
- Memory access patterns: Both implementations have similar memory access patterns for the 5×52-bit limb representation
The squaring optimization is more effective because it reduces the number of multiplications by exploiting a[i]·a[j] = a[j]·a[i].
Memory Allocations
| Operation | Pure Go | x86-64 ASM | libsecp256k1 |
|---|---|---|---|
| Pubkey Derivation | 256 B / 4 allocs | 256 B / 4 allocs | 504 B / 13 allocs |
| Sign | 576 B / 10 allocs | 576 B / 10 allocs | 400 B / 8 allocs |
| Verify | 128 B / 4 allocs | 128 B / 4 allocs | 312 B / 8 allocs |
| ECDH | 209 B / 5 allocs | 209 B / 5 allocs | N/A |
The Pure Go and assembly implementations have identical memory profiles since assembly only affects computation, not allocation patterns. libsecp256k1 via purego has higher allocations due to the FFI overhead.
Analysis
Why Assembly Improvement is Limited at High Level
The scalar multiplication speedup (53%) and field squaring speedup (21%) don't fully translate to proportional high-level operation improvements because:
-
Field operations dominate: Point multiplication on the elliptic curve spends most time in field arithmetic (modular multiplication/squaring over the prime field p), not scalar arithmetic over the group order n.
-
Operation breakdown: In a typical signature verification:
- ~90% of time: Field multiplications and squarings for point operations
- ~5% of time: Scalar arithmetic
- ~5% of time: Other operations (hashing, memory, etc.)
-
Amdahl's Law: The 21% field squaring speedup affects roughly half of field operations (squaring is called frequently in inversion and exponentiation), yielding ~10% improvement in field-heavy code paths.
libsecp256k1 Performance
The Bitcoin Core C library via purego shows excellent performance:
- 2.7-3.4x faster for most operations
- Uses highly optimized field arithmetic with platform-specific assembly
- Employs advanced techniques like GLV endomorphism
x86-64 Assembly Implementation Details
Scalar Multiplication (scalar_amd64.s)
Implements the same 3-phase reduction algorithm as bitcoin-core/secp256k1:
3-Phase Reduction Algorithm:
-
Phase 1: 512 bits → 385 bits
m[0..6] = l[0..3] + l[4..7] * NC -
Phase 2: 385 bits → 258 bits
p[0..4] = m[0..3] + m[4..6] * NC -
Phase 3: 258 bits → 256 bits
r[0..3] = p[0..3] + p[4] * NCPlus final conditional reduction if result ≥ n
Constants (NC = 2^256 - n):
NC0 = 0x402DA1732FC9BEBFNC1 = 0x4551231950B75FC4NC2 = 1
Field Multiplication and Squaring (field_amd64.s)
Ported from bitcoin-core/secp256k1's field_5x52_int128_impl.h:
5×52-bit Limb Representation:
- Field element value = Σ(n[i] × 2^(52×i)) for i = 0..4
- Each limb n[i] fits in 52 bits (with some headroom for accumulation)
- Total: 260 bits capacity for 256-bit field elements
Reduction Constants:
- Field prime p = 2^256 - 2^32 - 977
- R = 2^256 mod p = 0x1000003D10 (shifted for 52-bit alignment)
- M = 0xFFFFFFFFFFFFF (52-bit mask)
Algorithm Highlights:
- Uses 128-bit accumulators (via MULQ instruction producing DX:AX)
- Interleaves computation of partial products with reduction
- Squaring exploits symmetry: 2·a[i]·a[j] computed once instead of twice
Raw Benchmark Data
goos: linux
goarch: amd64
pkg: p256k1.mleku.dev/bench
cpu: AMD Ryzen 5 PRO 4650G with Radeon Graphics
# High-level operations (benchtime=2s)
BenchmarkPureGo_PubkeyDerivation-12 44107 56085 ns/op 256 B/op 4 allocs/op
BenchmarkPureGo_Sign-12 41503 56182 ns/op 576 B/op 10 allocs/op
BenchmarkPureGo_Verify-12 17293 144012 ns/op 128 B/op 4 allocs/op
BenchmarkPureGo_ECDH-12 22831 107799 ns/op 209 B/op 5 allocs/op
BenchmarkAVX2_PubkeyDerivation-12 43000 55724 ns/op 256 B/op 4 allocs/op
BenchmarkAVX2_Sign-12 41588 55999 ns/op 576 B/op 10 allocs/op
BenchmarkAVX2_Verify-12 17684 139552 ns/op 128 B/op 4 allocs/op
BenchmarkAVX2_ECDH-12 22786 106296 ns/op 209 B/op 5 allocs/op
BenchmarkLibSecp_Sign-12 59470 39916 ns/op 400 B/op 8 allocs/op
BenchmarkLibSecp_PubkeyDerivation-12 119511 20844 ns/op 504 B/op 13 allocs/op
BenchmarkLibSecp_Verify-12 57483 42102 ns/op 312 B/op 8 allocs/op
BenchmarkPubkeyDerivation-12 42465 54030 ns/op 256 B/op 4 allocs/op
BenchmarkSign-12 85609 28920 ns/op 576 B/op 10 allocs/op
BenchmarkVerify-12 17397 139216 ns/op 128 B/op 4 allocs/op
BenchmarkECDH-12 22885 104530 ns/op 209 B/op 5 allocs/op
# Isolated scalar operations (benchtime=2s)
BenchmarkScalarMulPureGo-12 50429706 46.52 ns/op
BenchmarkScalarMulAVX2-12 79820377 30.49 ns/op
BenchmarkScalarAddPureGo-12 464323708 5.288 ns/op
BenchmarkScalarAddAVX2-12 549494175 4.694 ns/op
# Isolated field operations (benchtime=1s, count=5)
BenchmarkFieldMulAsm-12 46677114 25.82 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulAsm-12 45379737 26.63 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulAsm-12 47394996 25.99 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulAsm-12 48337986 27.05 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulAsm-12 47056432 27.52 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulPureGo-12 42025989 27.86 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulPureGo-12 39620865 27.44 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulPureGo-12 39708454 27.25 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulPureGo-12 43870612 27.77 ns/op 0 B/op 0 allocs/op
BenchmarkFieldMulPureGo-12 44919584 27.41 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrAsm-12 59990847 21.63 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrAsm-12 57070836 21.85 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrAsm-12 55419507 21.81 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrAsm-12 57015470 21.93 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrAsm-12 54106294 21.12 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrPureGo-12 40245084 27.62 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrPureGo-12 43287774 27.04 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrPureGo-12 44501200 28.47 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrPureGo-12 46260654 27.04 ns/op 0 B/op 0 allocs/op
BenchmarkFieldSqrPureGo-12 45252552 27.75 ns/op 0 B/op 0 allocs/op
Conclusions
- Scalar multiplication is 53% faster with x86-64 assembly (46.52 ns → 30.49 ns)
- Scalar addition is 13% faster with x86-64 assembly (5.29 ns → 4.69 ns)
- Field squaring is 21% faster with x86-64 assembly (27.5 ns → 21.7 ns)
- Field multiplication is 6% faster with x86-64 assembly (27.5 ns → 26.0 ns)
- High-level operation improvements are modest (~1-3%) due to the complexity of the full cryptographic pipeline
- libsecp256k1 is 2.7-3.4x faster for cryptographic operations (uses additional optimizations like GLV endomorphism)
- Pure Go is competitive - within 3x of highly optimized C for most operations
- Memory efficiency is identical between Pure Go and assembly implementations
Future Optimization Opportunities
To achieve larger speedups, focus on:
- BMI2 instructions: Use MULX/ADCX/ADOX for better carry handling in field multiplication (potential 10-20% gain)
- AVX-512 IFMA: If available, use 52-bit multiply-add instructions for massive field operation speedup
- GLV endomorphism: Implement the secp256k1-specific optimization that splits scalar multiplication
- Vectorized point operations: Batch multiple independent point operations using SIMD
- ARM64 NEON: Add optimizations for Apple Silicon and ARM servers
References
- bitcoin-core/secp256k1 - Reference C implementation
- scalar_4x64_impl.h - Scalar reduction algorithm
- field_5x52_int128_impl.h - Field arithmetic implementation
- Efficient Modular Multiplication - Research on modular arithmetic optimization