mleku/p256k1

Fork 0

Files

mleku 88bc5b9a3d

add port of field operations assembler from libsecp256k1

2025-11-28 19:46:44 +00:00

12 KiB

Raw Blame History

Benchmark Report: p256k1 Implementation Comparison

This report compares performance of different secp256k1 implementations:

Pure Go - p256k1 with assembly disabled (baseline)
AVX2/ASM - p256k1 with x86-64 assembly enabled (scalar and field operations)
libsecp256k1 - Bitcoin Core's C library via purego (no CGO)
Default - p256k1 with automatic feature detection

Test Environment

Platform: Linux 6.8.0 (amd64)
CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics (12 threads)
Go Version: go1.23+
Date: 2025-11-28

High-Level Operation Benchmarks

Operation	Pure Go	AVX2	libsecp256k1	Default
Pubkey Derivation	56.09 µs	55.72 µs	20.84 µs	54.03 µs
Sign	56.18 µs	56.00 µs	39.92 µs	28.92 µs
Verify	144.01 µs	139.55 µs	42.10 µs	139.22 µs
ECDH	107.80 µs	106.30 µs	N/A	104.53 µs

Relative Performance (vs Pure Go)

Operation	AVX2	libsecp256k1
Pubkey Derivation	1.01x faster	2.69x faster
Sign	1.00x	1.41x faster
Verify	1.03x faster	3.42x faster
ECDH	1.01x faster	N/A

Scalar Operation Benchmarks (Isolated)

These benchmarks measure the individual scalar arithmetic operations in isolation:

Operation	Pure Go	x86-64 Assembly	Speedup
Scalar Multiply	46.52 ns	30.49 ns	1.53x faster
Scalar Add	5.29 ns	4.69 ns	1.13x faster

The x86-64 scalar multiplication shows a 53% improvement over pure Go, demonstrating the effectiveness of the optimized 512-bit reduction algorithm.

Field Operation Benchmarks (Isolated)

Field operations (modular arithmetic over the secp256k1 prime field) dominate elliptic curve computations. These benchmarks measure the assembly-optimized field multiplication and squaring:

Operation	Pure Go	x86-64 Assembly	Speedup
Field Multiply	27.5 ns	26.0 ns	1.06x faster
Field Square	27.5 ns	21.7 ns	1.27x faster

The field squaring assembly shows a 21% improvement because it exploits the symmetry of squaring (computing 2·a[i]·a[j] once instead of a[i]·a[j] + a[j]·a[i]).

Why Field Assembly Speedup is More Modest

The field multiplication assembly provides a smaller speedup than scalar multiplication because:

Go's uint128 emulation is efficient: The pure Go implementation uses bits.Mul64 and bits.Add64 which compile to efficient machine code
No SIMD opportunity: Field multiplication requires sequential 128-bit accumulator operations that don't parallelize well
Memory access patterns: Both implementations have similar memory access patterns for the 5×52-bit limb representation

The squaring optimization is more effective because it reduces the number of multiplications by exploiting a[i]·a[j] = a[j]·a[i].

Memory Allocations

Operation	Pure Go	x86-64 ASM	libsecp256k1
Pubkey Derivation	256 B / 4 allocs	256 B / 4 allocs	504 B / 13 allocs
Sign	576 B / 10 allocs	576 B / 10 allocs	400 B / 8 allocs
Verify	128 B / 4 allocs	128 B / 4 allocs	312 B / 8 allocs
ECDH	209 B / 5 allocs	209 B / 5 allocs	N/A

The Pure Go and assembly implementations have identical memory profiles since assembly only affects computation, not allocation patterns. libsecp256k1 via purego has higher allocations due to the FFI overhead.

Analysis

Why Assembly Improvement is Limited at High Level

The scalar multiplication speedup (53%) and field squaring speedup (21%) don't fully translate to proportional high-level operation improvements because:

Field operations dominate: Point multiplication on the elliptic curve spends most time in field arithmetic (modular multiplication/squaring over the prime field p), not scalar arithmetic over the group order n.
Operation breakdown: In a typical signature verification:
- ~90% of time: Field multiplications and squarings for point operations
- ~5% of time: Scalar arithmetic
- ~5% of time: Other operations (hashing, memory, etc.)
Amdahl's Law: The 21% field squaring speedup affects roughly half of field operations (squaring is called frequently in inversion and exponentiation), yielding ~10% improvement in field-heavy code paths.

libsecp256k1 Performance

The Bitcoin Core C library via purego shows excellent performance:

2.7-3.4x faster for most operations
Uses highly optimized field arithmetic with platform-specific assembly
Employs advanced techniques like GLV endomorphism

x86-64 Assembly Implementation Details

Scalar Multiplication (`scalar_amd64.s`)

Implements the same 3-phase reduction algorithm as bitcoin-core/secp256k1:

3-Phase Reduction Algorithm:

Phase 1: 512 bits → 385 bits
```
m[0..6] = l[0..3] + l[4..7] * NC
```
Phase 2: 385 bits → 258 bits
```
p[0..4] = m[0..3] + m[4..6] * NC
```
Phase 3: 258 bits → 256 bits
```
r[0..3] = p[0..3] + p[4] * NC
```
Plus final conditional reduction if result ≥ n

Constants (NC = 2^256 - n):

NC0 = 0x402DA1732FC9BEBF
NC1 = 0x4551231950B75FC4
NC2 = 1

Field Multiplication and Squaring (`field_amd64.s`)

Ported from bitcoin-core/secp256k1's field_5x52_int128_impl.h:

5×52-bit Limb Representation:

Field element value = Σ(n[i] × 2^(52×i)) for i = 0..4
Each limb n[i] fits in 52 bits (with some headroom for accumulation)
Total: 260 bits capacity for 256-bit field elements

Reduction Constants:

Field prime p = 2^256 - 2^32 - 977
R = 2^256 mod p = 0x1000003D10 (shifted for 52-bit alignment)
M = 0xFFFFFFFFFFFFF (52-bit mask)

Algorithm Highlights:

Uses 128-bit accumulators (via MULQ instruction producing DX:AX)
Interleaves computation of partial products with reduction
Squaring exploits symmetry: 2·a[i]·a[j] computed once instead of twice

Raw Benchmark Data

goos: linux
goarch: amd64
pkg: p256k1.mleku.dev/bench
cpu: AMD Ryzen 5 PRO 4650G with Radeon Graphics

# High-level operations (benchtime=2s)
BenchmarkPureGo_PubkeyDerivation-12     	   44107	     56085 ns/op	     256 B/op	       4 allocs/op
BenchmarkPureGo_Sign-12                 	   41503	     56182 ns/op	     576 B/op	      10 allocs/op
BenchmarkPureGo_Verify-12               	   17293	    144012 ns/op	     128 B/op	       4 allocs/op
BenchmarkPureGo_ECDH-12                 	   22831	    107799 ns/op	     209 B/op	       5 allocs/op
BenchmarkAVX2_PubkeyDerivation-12       	   43000	     55724 ns/op	     256 B/op	       4 allocs/op
BenchmarkAVX2_Sign-12                   	   41588	     55999 ns/op	     576 B/op	      10 allocs/op
BenchmarkAVX2_Verify-12                 	   17684	    139552 ns/op	     128 B/op	       4 allocs/op
BenchmarkAVX2_ECDH-12                   	   22786	    106296 ns/op	     209 B/op	       5 allocs/op
BenchmarkLibSecp_Sign-12                	   59470	     39916 ns/op	     400 B/op	       8 allocs/op
BenchmarkLibSecp_PubkeyDerivation-12    	  119511	     20844 ns/op	     504 B/op	      13 allocs/op
BenchmarkLibSecp_Verify-12              	   57483	     42102 ns/op	     312 B/op	       8 allocs/op
BenchmarkPubkeyDerivation-12            	   42465	     54030 ns/op	     256 B/op	       4 allocs/op
BenchmarkSign-12                        	   85609	     28920 ns/op	     576 B/op	      10 allocs/op
BenchmarkVerify-12                      	   17397	    139216 ns/op	     128 B/op	       4 allocs/op
BenchmarkECDH-12                        	   22885	    104530 ns/op	     209 B/op	       5 allocs/op

# Isolated scalar operations (benchtime=2s)
BenchmarkScalarMulPureGo-12    	50429706	        46.52 ns/op
BenchmarkScalarMulAVX2-12      	79820377	        30.49 ns/op
BenchmarkScalarAddPureGo-12    	464323708	         5.288 ns/op
BenchmarkScalarAddAVX2-12      	549494175	         4.694 ns/op

# Isolated field operations (benchtime=1s, count=5)
BenchmarkFieldMulAsm-12       	46677114	        25.82 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	45379737	        26.63 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	47394996	        25.99 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	48337986	        27.05 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulAsm-12       	47056432	        27.52 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	42025989	        27.86 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	39620865	        27.44 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	39708454	        27.25 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	43870612	        27.77 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldMulPureGo-12    	44919584	        27.41 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	59990847	        21.63 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	57070836	        21.85 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	55419507	        21.81 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	57015470	        21.93 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrAsm-12       	54106294	        21.12 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	40245084	        27.62 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	43287774	        27.04 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	44501200	        28.47 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	46260654	        27.04 ns/op	       0 B/op	       0 allocs/op
BenchmarkFieldSqrPureGo-12    	45252552	        27.75 ns/op	       0 B/op	       0 allocs/op

Conclusions

Scalar multiplication is 53% faster with x86-64 assembly (46.52 ns → 30.49 ns)
Scalar addition is 13% faster with x86-64 assembly (5.29 ns → 4.69 ns)
Field squaring is 21% faster with x86-64 assembly (27.5 ns → 21.7 ns)
Field multiplication is 6% faster with x86-64 assembly (27.5 ns → 26.0 ns)
High-level operation improvements are modest (~1-3%) due to the complexity of the full cryptographic pipeline
libsecp256k1 is 2.7-3.4x faster for cryptographic operations (uses additional optimizations like GLV endomorphism)
Pure Go is competitive - within 3x of highly optimized C for most operations
Memory efficiency is identical between Pure Go and assembly implementations

Future Optimization Opportunities

To achieve larger speedups, focus on:

BMI2 instructions: Use MULX/ADCX/ADOX for better carry handling in field multiplication (potential 10-20% gain)
AVX-512 IFMA: If available, use 52-bit multiply-add instructions for massive field operation speedup
GLV endomorphism: Implement the secp256k1-specific optimization that splits scalar multiplication
Vectorized point operations: Batch multiple independent point operations using SIMD
ARM64 NEON: Add optimizations for Apple Silicon and ARM servers

References

bitcoin-core/secp256k1 - Reference C implementation
scalar_4x64_impl.h - Scalar reduction algorithm
field_5x52_int128_impl.h - Field arithmetic implementation
Efficient Modular Multiplication - Research on modular arithmetic optimization

12 KiB Raw Blame History Unescape Escape