Files
p256k1/VERIFICATION_PERFORMANCE_ANALYSIS.md
mleku f2ddcfacbb Refactor Ecmult functions for optimized windowed multiplication and enhance performance
This commit introduces a new `ecmultWindowedVar` function that implements optimized windowed multiplication for scalar multiplication, significantly improving performance during verification operations. The existing `Ecmult` function is updated to utilize this new implementation, converting points to affine coordinates for efficiency. Additionally, the `EcmultConst` function is retained for constant-time operations. The changes also include enhancements to the generator multiplication context, utilizing precomputed byte points for improved efficiency. Overall, these optimizations lead to a notable reduction in operation times for cryptographic computations.
2025-11-01 21:39:36 +00:00

6.3 KiB

Verification Performance Analysis: NextP256K vs P256K1

Summary

NextP256K's verification is 4.7x faster than p256k1 (40,017 ns/op vs 186,054 ns/op) because it uses libsecp256k1's highly optimized C implementation, while p256k1 uses a simple binary multiplication algorithm.

Root Cause

The performance bottleneck is in EcmultConst, which is used to compute e*P during Schnorr verification.

Schnorr Verification Algorithm

// SchnorrVerify verifies a Schnorr signature following BIP-340
func SchnorrVerify(sig64 []byte, msg32 []byte, xonlyPubkey *XOnlyPubkey) bool {
	// ... validation ...
	
	// Compute R = s*G - e*P
	// First compute s*G
	var sG GroupElementJacobian
	EcmultGen(&sG, &s)  // Fast: uses optimized precomputed tables

	// Compute e*P where P is the x-only pubkey
	var eP GroupElementJacobian
	EcmultConst(&eP, &pk, &e)  // Slow: uses simple binary method
	
	// ... rest of verification ...
}

Performance Breakdown

  1. s*G computation (EcmultGen):

    • Uses 8-bit byte-based precomputed tables
    • Highly optimized: ~58,618 ns/op for pubkey derivation
    • Fast because the generator point G is fixed and precomputed
  2. e*P computation (EcmultConst):

    • Uses simple binary method with 256 iterations
    • Each iteration: double, check bit, potentially add
    • This is the bottleneck

Current EcmultConst Implementation

// EcmultConst computes r = q * a using constant-time multiplication
// This is a simplified implementation for Phase 3 - can be optimized later
func EcmultConst(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
	// ... edge cases ...
	
	// Process bits from MSB to LSB
	for i := 0; i < 256; i++ {
		if i > 0 {
			r.double(r)
		}
		
		// Get bit i (from MSB)
		bit := q.getBits(uint(255-i), 1)
		if bit != 0 {
			if r.isInfinity() {
				*r = base
			} else {
				r.addVar(r, &base)
			}
		}
	}
}

Problem: This performs 256 iterations, each requiring:

  • One field element doubling operation
  • One bit extraction
  • Potentially one point addition

For verification, this means 256 doublings + up to 256 additions per verification, which is extremely inefficient.

Why NextP256K is Faster

NextP256K uses libsecp256k1's optimized C implementation (secp256k1_ecmult_const) which:

  1. Uses GLV Endomorphism:

    • Splits the scalar into two smaller components using the curve's endomorphism
    • Computes two smaller multiplications instead of one large one
    • Reduces the effective bit length from 256 to ~128 bits per component
  2. Windowed Precomputation:

    • Precomputes a table of multiples of the base point
    • Uses windowed lookups instead of processing bits one at a time
    • Processes multiple bits per iteration (typically 4-6 bits at a time)
  3. Signed-Digit Multi-Comb Algorithm:

    • Uses a more efficient representation that reduces the number of additions
    • Minimizes the number of point operations required
  4. Assembly Optimizations:

    • Field arithmetic operations are optimized in assembly
    • Hand-tuned for specific CPU architectures

Reference Implementation

The C reference shows the complexity:

static void secp256k1_ecmult_const(secp256k1_gej *r, const secp256k1_ge *a, const secp256k1_scalar *q) {
    /* The approach below combines the signed-digit logic from Mike Hamburg's
     * "Fast and compact elliptic-curve cryptography" (https://eprint.iacr.org/2012/309)
     * Section 3.3, with the GLV endomorphism.
     * ... */
    
    /* Precompute table for base point and lambda * base point */
    
    /* Process bits in groups using windowed lookups */
    for (group = ECMULT_CONST_GROUPS - 1; group >= 0; --group) {
        /* Lookup precomputed points */
        ECMULT_CONST_TABLE_GET_GE(&t, pre_a, bits1);
        /* ... */
    }
}

Performance Impact

Benchmark Results

Operation P256K1 NextP256K Speedup
Verification 186,054 ns/op 40,017 ns/op 4.7x
Signing 31,937 ns/op 52,060 ns/op 0.6x (slower)
Pubkey Derivation 58,618 ns/op 280,835 ns/op 0.2x (slower)

Note: NextP256K is slower for signing and pubkey derivation due to CGO overhead for smaller operations, but much faster for verification because the computation is more complex.

Optimization Opportunities

To improve p256k1's verification performance, EcmultConst should be optimized to:

  1. Implement GLV Endomorphism:

    • Split scalar using secp256k1's endomorphism
    • Compute two smaller multiplications
    • Combine results
  2. Add Windowed Precomputation:

    • Precompute a table of multiples of the base point
    • Process bits in groups (windows) instead of individually
    • Use lookup tables instead of repeated additions
  3. Consider Variable-Time Optimization:

    • For verification (public operation), variable-time algorithms are acceptable
    • Could use Ecmult instead of EcmultConst if constant-time isn't required
  4. Implement Signed-Digit Representation:

    • Use signed-digit multi-comb algorithm
    • Reduce the number of additions required

Complexity Comparison

Current (Simple Binary Method)

  • Operations: O(256) doublings + O(256) additions (worst case)
  • Complexity: ~256 point operations

Optimized (Windowed + GLV)

  • Operations: O(64) doublings + O(16) additions (with window size 4)
  • Complexity: ~80 point operations (4x improvement)

With Assembly Optimizations

  • Additional: 2-3x speedup from optimized field arithmetic
  • Total: ~10-15x faster than simple binary method

Conclusion

The 4.7x performance difference is primarily due to:

  1. Algorithmic efficiency: Windowed multiplication vs. simple binary method
  2. GLV endomorphism: Splitting scalar into smaller components
  3. Assembly optimizations: Hand-tuned field arithmetic in C
  4. Better memory access patterns: Precomputed tables vs. repeated computations

The optimization is non-trivial and would require implementing:

  • GLV endomorphism support
  • Windowed precomputation tables
  • Signed-digit multi-comb algorithm
  • Potentially assembly optimizations for field arithmetic

For now, NextP256K's advantage in verification is expected given its use of the mature, highly optimized libsecp256k1 C library.