Files
p256k1/BENCHMARK_REPORT_AVX2.md

12 KiB
Raw Blame History

Benchmark Report: p256k1 Implementation Comparison

This report compares performance of different secp256k1 implementations:

  1. Pure Go - p256k1 with assembly disabled (baseline)
  2. AVX2/ASM - p256k1 with x86-64 assembly enabled (scalar and field operations)
  3. libsecp256k1 - Bitcoin Core's C library via purego (no CGO)
  4. Default - p256k1 with automatic feature detection

Test Environment

  • Platform: Linux 6.8.0 (amd64)
  • CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics (12 threads)
  • Go Version: go1.23+
  • Date: 2025-11-28

High-Level Operation Benchmarks

Operation Pure Go AVX2 libsecp256k1 Default
Pubkey Derivation 56.09 µs 55.72 µs 20.84 µs 54.03 µs
Sign 56.18 µs 56.00 µs 39.92 µs 28.92 µs
Verify 144.01 µs 139.55 µs 42.10 µs 139.22 µs
ECDH 107.80 µs 106.30 µs N/A 104.53 µs

Relative Performance (vs Pure Go)

Operation AVX2 libsecp256k1
Pubkey Derivation 1.01x faster 2.69x faster
Sign 1.00x 1.41x faster
Verify 1.03x faster 3.42x faster
ECDH 1.01x faster N/A

Scalar Operation Benchmarks (Isolated)

These benchmarks measure the individual scalar arithmetic operations in isolation:

Operation Pure Go x86-64 Assembly Speedup
Scalar Multiply 46.52 ns 30.49 ns 1.53x faster
Scalar Add 5.29 ns 4.69 ns 1.13x faster

The x86-64 scalar multiplication shows a 53% improvement over pure Go, demonstrating the effectiveness of the optimized 512-bit reduction algorithm.

Field Operation Benchmarks (Isolated)

Field operations (modular arithmetic over the secp256k1 prime field) dominate elliptic curve computations. These benchmarks measure the assembly-optimized field multiplication and squaring:

Operation Pure Go x86-64 Assembly Speedup
Field Multiply 27.5 ns 26.0 ns 1.06x faster
Field Square 27.5 ns 21.7 ns 1.27x faster

The field squaring assembly shows a 21% improvement because it exploits the symmetry of squaring (computing 2·a[i]·a[j] once instead of a[i]·a[j] + a[j]·a[i]).

Why Field Assembly Speedup is More Modest

The field multiplication assembly provides a smaller speedup than scalar multiplication because:

  1. Go's uint128 emulation is efficient: The pure Go implementation uses bits.Mul64 and bits.Add64 which compile to efficient machine code
  2. No SIMD opportunity: Field multiplication requires sequential 128-bit accumulator operations that don't parallelize well
  3. Memory access patterns: Both implementations have similar memory access patterns for the 5×52-bit limb representation

The squaring optimization is more effective because it reduces the number of multiplications by exploiting a[i]·a[j] = a[j]·a[i].

Memory Allocations

Operation Pure Go x86-64 ASM libsecp256k1
Pubkey Derivation 256 B / 4 allocs 256 B / 4 allocs 504 B / 13 allocs
Sign 576 B / 10 allocs 576 B / 10 allocs 400 B / 8 allocs
Verify 128 B / 4 allocs 128 B / 4 allocs 312 B / 8 allocs
ECDH 209 B / 5 allocs 209 B / 5 allocs N/A

The Pure Go and assembly implementations have identical memory profiles since assembly only affects computation, not allocation patterns. libsecp256k1 via purego has higher allocations due to the FFI overhead.

Analysis

Why Assembly Improvement is Limited at High Level

The scalar multiplication speedup (53%) and field squaring speedup (21%) don't fully translate to proportional high-level operation improvements because:

  1. Field operations dominate: Point multiplication on the elliptic curve spends most time in field arithmetic (modular multiplication/squaring over the prime field p), not scalar arithmetic over the group order n.

  2. Operation breakdown: In a typical signature verification:

    • ~90% of time: Field multiplications and squarings for point operations
    • ~5% of time: Scalar arithmetic
    • ~5% of time: Other operations (hashing, memory, etc.)
  3. Amdahl's Law: The 21% field squaring speedup affects roughly half of field operations (squaring is called frequently in inversion and exponentiation), yielding ~10% improvement in field-heavy code paths.

libsecp256k1 Performance

The Bitcoin Core C library via purego shows excellent performance:

  • 2.7-3.4x faster for most operations
  • Uses highly optimized field arithmetic with platform-specific assembly
  • Employs advanced techniques like GLV endomorphism

x86-64 Assembly Implementation Details

Scalar Multiplication (scalar_amd64.s)

Implements the same 3-phase reduction algorithm as bitcoin-core/secp256k1:

3-Phase Reduction Algorithm:

  1. Phase 1: 512 bits → 385 bits

    m[0..6] = l[0..3] + l[4..7] * NC
    
  2. Phase 2: 385 bits → 258 bits

    p[0..4] = m[0..3] + m[4..6] * NC
    
  3. Phase 3: 258 bits → 256 bits

    r[0..3] = p[0..3] + p[4] * NC
    

    Plus final conditional reduction if result ≥ n

Constants (NC = 2^256 - n):

  • NC0 = 0x402DA1732FC9BEBF
  • NC1 = 0x4551231950B75FC4
  • NC2 = 1

Field Multiplication and Squaring (field_amd64.s)

Ported from bitcoin-core/secp256k1's field_5x52_int128_impl.h:

5×52-bit Limb Representation:

  • Field element value = Σ(n[i] × 2^(52×i)) for i = 0..4
  • Each limb n[i] fits in 52 bits (with some headroom for accumulation)
  • Total: 260 bits capacity for 256-bit field elements

Reduction Constants:

  • Field prime p = 2^256 - 2^32 - 977
  • R = 2^256 mod p = 0x1000003D10 (shifted for 52-bit alignment)
  • M = 0xFFFFFFFFFFFFF (52-bit mask)

Algorithm Highlights:

  • Uses 128-bit accumulators (via MULQ instruction producing DX:AX)
  • Interleaves computation of partial products with reduction
  • Squaring exploits symmetry: 2·a[i]·a[j] computed once instead of twice

Raw Benchmark Data

goos: linux
goarch: amd64
pkg: p256k1.mleku.dev/bench
cpu: AMD Ryzen 5 PRO 4650G with Radeon Graphics

# High-level operations (benchtime=2s)
BenchmarkPureGo_PubkeyDerivation-12     	   44107	     56085 ns/op	     256 B/op	       4 allocs/op
BenchmarkPureGo_Sign-12                 	   41503	     56182 ns/op	     576 B/op	      10 allocs/op
BenchmarkPureGo_Verify-12               	   17293	    144012 ns/op	     128 B/op	       4 allocs/op
BenchmarkPureGo_ECDH-12                 	   22831	    107799 ns/op	     209 B/op	       5 allocs/op
BenchmarkAVX2_PubkeyDerivation-12       	   43000	     55724 ns/op	     256 B/op	       4 allocs/op
BenchmarkAVX2_Sign-12                   	   41588	     55999 ns/op	     576 B/op	      10 allocs/op
BenchmarkAVX2_Verify-12                 	   17684	    139552 ns/op	     128 B/op	       4 allocs/op
BenchmarkAVX2_ECDH-12                   	   22786	    106296 ns/op	     209 B/op	       5 allocs/op
BenchmarkLibSecp_Sign-12                	   59470	     39916 ns/op	     400 B/op	       8 allocs/op
BenchmarkLibSecp_PubkeyDerivation-12    	  119511	     20844 ns/op	     504 B/op	      13 allocs/op
BenchmarkLibSecp_Verify-12              	   57483	     42102 ns/op	     312 B/op	       8 allocs/op
BenchmarkPubkeyDerivation-12            	   42465	     54030 ns/op	     256 B/op	       4 allocs/op
BenchmarkSign-12                        	   85609	     28920 ns/op	     576 B/op	      10 allocs/op
BenchmarkVerify-12                      	   17397	    139216 ns/op	     128 B/op	       4 allocs/op
BenchmarkECDH-12                        	   22885	    104530 ns/op	     209 B/op	       5 allocs/op

# Isolated scalar operations (benchtime=2s)
BenchmarkScalarMulPureGo-12    	50429706	        46.52 ns/op
BenchmarkScalarMulAVX2-12      	79820377	        30.49 ns/op
BenchmarkScalarAddPureGo-12    	464323708	         5.288 ns/op
BenchmarkScalarAddAVX2-12      	549494175	         4.694 ns/op

# Isolated field operations (benchtime=1s, count=5)
BenchmarkFieldMulAsm-12       	46677114	        25.82 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	45379737	        26.63 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	47394996	        25.99 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	48337986	        27.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	47056432	        27.52 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	42025989	        27.86 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	39620865	        27.44 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	39708454	        27.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	43870612	        27.77 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	44919584	        27.41 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	59990847	        21.63 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	57070836	        21.85 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	55419507	        21.81 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	57015470	        21.93 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	54106294	        21.12 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	40245084	        27.62 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	43287774	        27.04 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	44501200	        28.47 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	46260654	        27.04 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	45252552	        27.75 ns/op	       0 B/op	       0 allocs/op

Conclusions

  1. Scalar multiplication is 53% faster with x86-64 assembly (46.52 ns → 30.49 ns)
  2. Scalar addition is 13% faster with x86-64 assembly (5.29 ns → 4.69 ns)
  3. Field squaring is 21% faster with x86-64 assembly (27.5 ns → 21.7 ns)
  4. Field multiplication is 6% faster with x86-64 assembly (27.5 ns → 26.0 ns)
  5. High-level operation improvements are modest (~1-3%) due to the complexity of the full cryptographic pipeline
  6. libsecp256k1 is 2.7-3.4x faster for cryptographic operations (uses additional optimizations like GLV endomorphism)
  7. Pure Go is competitive - within 3x of highly optimized C for most operations
  8. Memory efficiency is identical between Pure Go and assembly implementations

Future Optimization Opportunities

To achieve larger speedups, focus on:

  1. BMI2 instructions: Use MULX/ADCX/ADOX for better carry handling in field multiplication (potential 10-20% gain)
  2. AVX-512 IFMA: If available, use 52-bit multiply-add instructions for massive field operation speedup
  3. GLV endomorphism: Implement the secp256k1-specific optimization that splits scalar multiplication
  4. Vectorized point operations: Batch multiple independent point operations using SIMD
  5. ARM64 NEON: Add optimizations for Apple Silicon and ARM servers

References