8.9 KiB
AVX2 secp256k1 Implementation Plan
Overview
This implementation uses 128-bit limbs with AVX2 256-bit registers for secp256k1 cryptographic operations. The key insight is that AVX2's YMM registers can hold two 128-bit values, enabling efficient parallel processing.
Data Layout
Register Mapping
| Type | Size | AVX2 Representation | Registers |
|---|---|---|---|
| Uint128 | 128-bit | 1×128-bit in XMM or half YMM | 0.5 YMM |
| Scalar | 256-bit | 2×128-bit limbs | 1 YMM |
| FieldElement | 256-bit | 2×128-bit limbs | 1 YMM |
| AffinePoint | 512-bit | 2×FieldElement (x, y) | 2 YMM |
| JacobianPoint | 768-bit | 3×FieldElement (x, y, z) | 3 YMM |
Memory Layout
Uint128:
[Lo:64][Hi:64] = 128 bits
Scalar/FieldElement (in YMM register):
YMM = [D[0].Lo:64][D[0].Hi:64][D[1].Lo:64][D[1].Hi:64]
├─── 128-bit limb 0 ────┤├─── 128-bit limb 1 ────┤
AffinePoint (2 YMM registers):
YMM0 = X coordinate (256 bits)
YMM1 = Y coordinate (256 bits)
JacobianPoint (3 YMM registers):
YMM0 = X coordinate (256 bits)
YMM1 = Y coordinate (256 bits)
YMM2 = Z coordinate (256 bits)
Implementation Phases
Phase 1: Core 128-bit Operations
File: uint128_amd64.s
-
uint128Add - Add two 128-bit values with carry out
- Instructions:
ADDQ,ADCQ - Input: XMM0 (a), XMM1 (b)
- Output: XMM0 (result), carry flag
- Instructions:
-
uint128Sub - Subtract with borrow
- Instructions:
SUBQ,SBBQ
- Instructions:
-
uint128Mul - Multiply two 64-bit values to get 128-bit result
- Instructions:
MULQ(scalar) orVPMULUDQ(SIMD)
- Instructions:
-
uint128Mul128 - Full 128×128→256 multiplication
- This is the critical operation for field/scalar multiplication
- Uses Karatsuba or schoolbook with
VPMULUDQ
Phase 2: Scalar Operations (mod n)
File: scalar_amd64.go (stubs), scalar_amd64.s (assembly)
-
ScalarAdd - Add two scalars mod n
Load a into YMM0 Load b into YMM1 VPADDQ YMM0, YMM0, YMM1 ; parallel add of 64-bit lanes Handle carries between 64-bit lanes Conditional subtract n if >= n -
ScalarSub - Subtract scalars mod n
- Similar to add but with
VPSUBQand conditional add of n
- Similar to add but with
-
ScalarMul - Multiply scalars mod n
- Compute 512-bit product using 128×128 multiplications
- Reduce mod n using Barrett or Montgomery reduction
- 512-bit intermediate fits in 2 YMM registers
-
ScalarNegate - Compute -a mod n
n - ausing subtraction
-
ScalarInverse - Compute a^(-1) mod n
- Use Fermat's little theorem: a^(n-2) mod n
- Requires efficient square-and-multiply
-
ScalarIsZero, ScalarIsHigh, ScalarEqual - Comparisons
Phase 3: Field Operations (mod p)
File: field_amd64.go (stubs), field_amd64.s (assembly)
-
FieldAdd - Add two field elements mod p
Load a into YMM0 Load b into YMM1 VPADDQ YMM0, YMM0, YMM1 Handle carries Conditional subtract p if >= p -
FieldSub - Subtract field elements mod p
-
FieldMul - Multiply field elements mod p
- Most critical operation for performance
- 256×256→512 bit product, then reduce mod p
- secp256k1 has special structure: p = 2^256 - 2^32 - 977
- Reduction: if result >= 2^256, add (2^32 + 977) to lower bits
-
FieldSqr - Square a field element (optimized mul(a,a))
- Can save ~25% multiplications vs general multiply
-
FieldInv - Compute a^(-1) mod p
- Fermat: a^(p-2) mod p
- Use addition chain for efficiency
-
FieldSqrt - Compute square root mod p
- p ≡ 3 (mod 4), so sqrt(a) = a^((p+1)/4) mod p
-
FieldNegate, FieldIsZero, FieldEqual - Basic operations
Phase 4: Point Operations
File: point_amd64.go (stubs), point_amd64.s (assembly)
-
AffineToJacobian - Convert (x, y) to (x, y, 1)
-
JacobianToAffine - Convert (X, Y, Z) to (X/Z², Y/Z³)
- Requires field inversion
-
JacobianDouble - Point doubling
- ~4 field multiplications, ~4 field squarings, ~6 field additions
- All field ops can use AVX2 versions
-
JacobianAdd - Add two Jacobian points
- ~12 field multiplications, ~4 field squarings
-
JacobianAddAffine - Add Jacobian + Affine (optimized)
- ~8 field multiplications, ~3 field squarings
- Common case in scalar multiplication
-
ScalarMult - Compute k*P for scalar k and point P
- Use windowed NAF or GLV decomposition
- Core loop: double + conditional add
-
ScalarBaseMult - Compute k*G using precomputed table
- Precompute multiples of generator G
- Faster than general scalar mult
Phase 5: High-Level Operations
File: ecdsa.go, schnorr.go
- ECDSA Sign/Verify
- Schnorr Sign/Verify (BIP-340)
- ECDH - Shared secret computation
Assembly Conventions
Register Usage
YMM0-YMM7: Scratch registers (caller-saved)
YMM8-YMM15: Can be used but should be preserved
For our operations:
YMM0: Primary operand/result
YMM1: Secondary operand
YMM2-YMM5: Intermediate calculations
YMM6-YMM7: Constants (field prime, masks, etc.)
Key AVX2 Instructions
; Data movement
VMOVDQU YMM0, [mem] ; Load 256 bits unaligned
VMOVDQA YMM0, [mem] ; Load 256 bits aligned
VBROADCASTI128 YMM0, [mem] ; Broadcast 128-bit to both lanes
; Arithmetic
VPADDQ YMM0, YMM1, YMM2 ; Add packed 64-bit integers
VPSUBQ YMM0, YMM1, YMM2 ; Subtract packed 64-bit integers
VPMULUDQ YMM0, YMM1, YMM2 ; Multiply low 32-bits of each 64-bit lane
; Logical
VPAND YMM0, YMM1, YMM2 ; Bitwise AND
VPOR YMM0, YMM1, YMM2 ; Bitwise OR
VPXOR YMM0, YMM1, YMM2 ; Bitwise XOR
; Shifts
VPSLLQ YMM0, YMM1, imm ; Shift left logical 64-bit
VPSRLQ YMM0, YMM1, imm ; Shift right logical 64-bit
; Shuffles and permutes
VPERMQ YMM0, YMM1, imm ; Permute 64-bit elements
VPERM2I128 YMM0, YMM1, YMM2, imm ; Permute 128-bit lanes
VPALIGNR YMM0, YMM1, YMM2, imm ; Byte align
; Comparisons
VPCMPEQQ YMM0, YMM1, YMM2 ; Compare equal 64-bit
VPCMPGTQ YMM0, YMM1, YMM2 ; Compare greater than 64-bit
; Blending
VPBLENDVB YMM0, YMM1, YMM2, YMM3 ; Conditional blend
Carry Propagation Strategy
The tricky part of 128-bit limb arithmetic is carry propagation between the 64-bit halves and between the two 128-bit limbs.
Addition Carry Chain
Given: A = [A0.Lo, A0.Hi, A1.Lo, A1.Hi] (256 bits as 4×64)
B = [B0.Lo, B0.Hi, B1.Lo, B1.Hi]
Step 1: Add with VPADDQ (no carries)
R = A + B (per-lane, ignoring overflow)
Step 2: Detect carries
carry_0_to_1 = (R0.Lo < A0.Lo) ? 1 : 0 ; carry from Lo to Hi in limb 0
carry_1_to_2 = (R0.Hi < A0.Hi) ? 1 : 0 ; carry from limb 0 to limb 1
carry_2_to_3 = (R1.Lo < A1.Lo) ? 1 : 0 ; carry within limb 1
carry_out = (R1.Hi < A1.Hi) ? 1 : 0 ; overflow
Step 3: Propagate carries
R0.Hi += carry_0_to_1
R1.Lo += carry_1_to_2 + (R0.Hi < carry_0_to_1 ? 1 : 0)
R1.Hi += carry_2_to_3 + ...
This is complex in SIMD. Alternative: use ADCX/ADOX instructions (ADX extension) for scalar carry chains, which may be faster for sequential operations.
Multiplication Strategy
For 128×128→256 multiplication:
A = A.Hi * 2^64 + A.Lo
B = B.Hi * 2^64 + B.Lo
A * B = A.Hi*B.Hi * 2^128
+ (A.Hi*B.Lo + A.Lo*B.Hi) * 2^64
+ A.Lo*B.Lo
Using MULX (BMI2) for efficient 64×64→128:
MULX r1, r0, A.Lo ; r1:r0 = A.Lo * B.Lo
MULX r3, r2, A.Hi ; r3:r2 = A.Hi * B.Lo
... (4 multiplications total, then accumulate)
Testing Strategy
- Unit tests for each operation comparing against reference (main package)
- Edge cases: zero, one, max values, values near modulus
- Random tests: generate random inputs, compare results
- Benchmark comparisons: AVX2 vs pure Go implementation
File Structure
avx/
├── IMPLEMENTATION_PLAN.md (this file)
├── types.go (type definitions)
├── uint128.go (pure Go fallback)
├── uint128_amd64.go (Go stubs for assembly)
├── uint128_amd64.s (AVX2 assembly)
├── scalar.go (pure Go fallback)
├── scalar_amd64.go (Go stubs)
├── scalar_amd64.s (AVX2 assembly)
├── field.go (pure Go fallback)
├── field_amd64.go (Go stubs)
├── field_amd64.s (AVX2 assembly)
├── point.go (pure Go fallback)
├── point_amd64.go (Go stubs)
├── point_amd64.s (AVX2 assembly)
├── avx_test.go (tests)
└── bench_test.go (benchmarks)
Performance Targets
Compared to the current pure Go implementation:
- Scalar multiplication: 2-3x faster
- Field multiplication: 2-4x faster
- Point operations: 2-3x faster (dominated by field ops)
- ECDSA sign/verify: 2-3x faster overall
Dependencies
- Go 1.21+ (for assembly support)
- CPU with AVX2 support (Intel Haswell+, AMD Excavator+)
- Optional: BMI2 for MULX instruction (faster 64×64→128 multiply)