Files
p256k1/avx/IMPLEMENTATION_PLAN.md
2025-11-28 16:35:08 +00:00

8.9 KiB
Raw Blame History

AVX2 secp256k1 Implementation Plan

Overview

This implementation uses 128-bit limbs with AVX2 256-bit registers for secp256k1 cryptographic operations. The key insight is that AVX2's YMM registers can hold two 128-bit values, enabling efficient parallel processing.

Data Layout

Register Mapping

Type Size AVX2 Representation Registers
Uint128 128-bit 1×128-bit in XMM or half YMM 0.5 YMM
Scalar 256-bit 2×128-bit limbs 1 YMM
FieldElement 256-bit 2×128-bit limbs 1 YMM
AffinePoint 512-bit 2×FieldElement (x, y) 2 YMM
JacobianPoint 768-bit 3×FieldElement (x, y, z) 3 YMM

Memory Layout

Uint128:
  [Lo:64][Hi:64] = 128 bits

Scalar/FieldElement (in YMM register):
  YMM = [D[0].Lo:64][D[0].Hi:64][D[1].Lo:64][D[1].Hi:64]
        ├─── 128-bit limb 0 ────┤├─── 128-bit limb 1 ────┤

AffinePoint (2 YMM registers):
  YMM0 = X coordinate (256 bits)
  YMM1 = Y coordinate (256 bits)

JacobianPoint (3 YMM registers):
  YMM0 = X coordinate (256 bits)
  YMM1 = Y coordinate (256 bits)
  YMM2 = Z coordinate (256 bits)

Implementation Phases

Phase 1: Core 128-bit Operations

File: uint128_amd64.s

  1. uint128Add - Add two 128-bit values with carry out

    • Instructions: ADDQ, ADCQ
    • Input: XMM0 (a), XMM1 (b)
    • Output: XMM0 (result), carry flag
  2. uint128Sub - Subtract with borrow

    • Instructions: SUBQ, SBBQ
  3. uint128Mul - Multiply two 64-bit values to get 128-bit result

    • Instructions: MULQ (scalar) or VPMULUDQ (SIMD)
  4. uint128Mul128 - Full 128×128→256 multiplication

    • This is the critical operation for field/scalar multiplication
    • Uses Karatsuba or schoolbook with VPMULUDQ

Phase 2: Scalar Operations (mod n)

File: scalar_amd64.go (stubs), scalar_amd64.s (assembly)

  1. ScalarAdd - Add two scalars mod n

    Load a into YMM0
    Load b into YMM1
    VPADDQ YMM0, YMM0, YMM1    ; parallel add of 64-bit lanes
    Handle carries between 64-bit lanes
    Conditional subtract n if >= n
    
  2. ScalarSub - Subtract scalars mod n

    • Similar to add but with VPSUBQ and conditional add of n
  3. ScalarMul - Multiply scalars mod n

    • Compute 512-bit product using 128×128 multiplications
    • Reduce mod n using Barrett or Montgomery reduction
    • 512-bit intermediate fits in 2 YMM registers
  4. ScalarNegate - Compute -a mod n

    • n - a using subtraction
  5. ScalarInverse - Compute a^(-1) mod n

    • Use Fermat's little theorem: a^(n-2) mod n
    • Requires efficient square-and-multiply
  6. ScalarIsZero, ScalarIsHigh, ScalarEqual - Comparisons

Phase 3: Field Operations (mod p)

File: field_amd64.go (stubs), field_amd64.s (assembly)

  1. FieldAdd - Add two field elements mod p

    Load a into YMM0
    Load b into YMM1
    VPADDQ YMM0, YMM0, YMM1
    Handle carries
    Conditional subtract p if >= p
    
  2. FieldSub - Subtract field elements mod p

  3. FieldMul - Multiply field elements mod p

    • Most critical operation for performance
    • 256×256→512 bit product, then reduce mod p
    • secp256k1 has special structure: p = 2^256 - 2^32 - 977
    • Reduction: if result >= 2^256, add (2^32 + 977) to lower bits
  4. FieldSqr - Square a field element (optimized mul(a,a))

    • Can save ~25% multiplications vs general multiply
  5. FieldInv - Compute a^(-1) mod p

    • Fermat: a^(p-2) mod p
    • Use addition chain for efficiency
  6. FieldSqrt - Compute square root mod p

    • p ≡ 3 (mod 4), so sqrt(a) = a^((p+1)/4) mod p
  7. FieldNegate, FieldIsZero, FieldEqual - Basic operations

Phase 4: Point Operations

File: point_amd64.go (stubs), point_amd64.s (assembly)

  1. AffineToJacobian - Convert (x, y) to (x, y, 1)

  2. JacobianToAffine - Convert (X, Y, Z) to (X/Z², Y/Z³)

    • Requires field inversion
  3. JacobianDouble - Point doubling

    • ~4 field multiplications, ~4 field squarings, ~6 field additions
    • All field ops can use AVX2 versions
  4. JacobianAdd - Add two Jacobian points

    • ~12 field multiplications, ~4 field squarings
  5. JacobianAddAffine - Add Jacobian + Affine (optimized)

    • ~8 field multiplications, ~3 field squarings
    • Common case in scalar multiplication
  6. ScalarMult - Compute k*P for scalar k and point P

    • Use windowed NAF or GLV decomposition
    • Core loop: double + conditional add
  7. ScalarBaseMult - Compute k*G using precomputed table

    • Precompute multiples of generator G
    • Faster than general scalar mult

Phase 5: High-Level Operations

File: ecdsa.go, schnorr.go

  1. ECDSA Sign/Verify
  2. Schnorr Sign/Verify (BIP-340)
  3. ECDH - Shared secret computation

Assembly Conventions

Register Usage

YMM0-YMM7:  Scratch registers (caller-saved)
YMM8-YMM15: Can be used but should be preserved

For our operations:
YMM0: Primary operand/result
YMM1: Secondary operand
YMM2-YMM5: Intermediate calculations
YMM6-YMM7: Constants (field prime, masks, etc.)

Key AVX2 Instructions

; Data movement
VMOVDQU     YMM0, [mem]       ; Load 256 bits unaligned
VMOVDQA     YMM0, [mem]       ; Load 256 bits aligned
VBROADCASTI128 YMM0, [mem]    ; Broadcast 128-bit to both lanes

; Arithmetic
VPADDQ      YMM0, YMM1, YMM2  ; Add packed 64-bit integers
VPSUBQ      YMM0, YMM1, YMM2  ; Subtract packed 64-bit integers
VPMULUDQ    YMM0, YMM1, YMM2  ; Multiply low 32-bits of each 64-bit lane

; Logical
VPAND       YMM0, YMM1, YMM2  ; Bitwise AND
VPOR        YMM0, YMM1, YMM2  ; Bitwise OR
VPXOR       YMM0, YMM1, YMM2  ; Bitwise XOR

; Shifts
VPSLLQ      YMM0, YMM1, imm   ; Shift left logical 64-bit
VPSRLQ      YMM0, YMM1, imm   ; Shift right logical 64-bit

; Shuffles and permutes
VPERMQ      YMM0, YMM1, imm   ; Permute 64-bit elements
VPERM2I128  YMM0, YMM1, YMM2, imm ; Permute 128-bit lanes
VPALIGNR    YMM0, YMM1, YMM2, imm ; Byte align

; Comparisons
VPCMPEQQ    YMM0, YMM1, YMM2  ; Compare equal 64-bit
VPCMPGTQ    YMM0, YMM1, YMM2  ; Compare greater than 64-bit

; Blending
VPBLENDVB   YMM0, YMM1, YMM2, YMM3 ; Conditional blend

Carry Propagation Strategy

The tricky part of 128-bit limb arithmetic is carry propagation between the 64-bit halves and between the two 128-bit limbs.

Addition Carry Chain

Given: A = [A0.Lo, A0.Hi, A1.Lo, A1.Hi]  (256 bits as 4×64)
       B = [B0.Lo, B0.Hi, B1.Lo, B1.Hi]

Step 1: Add with VPADDQ (no carries)
  R = A + B (per-lane, ignoring overflow)

Step 2: Detect carries
  carry_0_to_1 = (R0.Lo < A0.Lo) ? 1 : 0  ; carry from Lo to Hi in limb 0
  carry_1_to_2 = (R0.Hi < A0.Hi) ? 1 : 0  ; carry from limb 0 to limb 1
  carry_2_to_3 = (R1.Lo < A1.Lo) ? 1 : 0  ; carry within limb 1
  carry_out    = (R1.Hi < A1.Hi) ? 1 : 0  ; overflow

Step 3: Propagate carries
  R0.Hi += carry_0_to_1
  R1.Lo += carry_1_to_2 + (R0.Hi < carry_0_to_1 ? 1 : 0)
  R1.Hi += carry_2_to_3 + ...

This is complex in SIMD. Alternative: use ADCX/ADOX instructions (ADX extension) for scalar carry chains, which may be faster for sequential operations.

Multiplication Strategy

For 128×128→256 multiplication:

A = A.Hi * 2^64 + A.Lo
B = B.Hi * 2^64 + B.Lo

A * B = A.Hi*B.Hi * 2^128
      + (A.Hi*B.Lo + A.Lo*B.Hi) * 2^64
      + A.Lo*B.Lo

Using MULX (BMI2) for efficient 64×64→128:
  MULX r1, r0, A.Lo    ; r1:r0 = A.Lo * B.Lo
  MULX r3, r2, A.Hi    ; r3:r2 = A.Hi * B.Lo
  ... (4 multiplications total, then accumulate)

Testing Strategy

  1. Unit tests for each operation comparing against reference (main package)
  2. Edge cases: zero, one, max values, values near modulus
  3. Random tests: generate random inputs, compare results
  4. Benchmark comparisons: AVX2 vs pure Go implementation

File Structure

avx/
├── IMPLEMENTATION_PLAN.md  (this file)
├── types.go                (type definitions)
├── uint128.go              (pure Go fallback)
├── uint128_amd64.go        (Go stubs for assembly)
├── uint128_amd64.s         (AVX2 assembly)
├── scalar.go               (pure Go fallback)
├── scalar_amd64.go         (Go stubs)
├── scalar_amd64.s          (AVX2 assembly)
├── field.go                (pure Go fallback)
├── field_amd64.go          (Go stubs)
├── field_amd64.s           (AVX2 assembly)
├── point.go                (pure Go fallback)
├── point_amd64.go          (Go stubs)
├── point_amd64.s           (AVX2 assembly)
├── avx_test.go             (tests)
└── bench_test.go           (benchmarks)

Performance Targets

Compared to the current pure Go implementation:

  • Scalar multiplication: 2-3x faster
  • Field multiplication: 2-4x faster
  • Point operations: 2-3x faster (dominated by field ops)
  • ECDSA sign/verify: 2-3x faster overall

Dependencies

  • Go 1.21+ (for assembly support)
  • CPU with AVX2 support (Intel Haswell+, AMD Excavator+)
  • Optional: BMI2 for MULX instruction (faster 64×64→128 multiply)