This commit introduces a new `ecmultWindowedVar` function that implements optimized windowed multiplication for scalar multiplication, significantly improving performance during verification operations. The existing `Ecmult` function is updated to utilize this new implementation, converting points to affine coordinates for efficiency. Additionally, the `EcmultConst` function is retained for constant-time operations. The changes also include enhancements to the generator multiplication context, utilizing precomputed byte points for improved efficiency. Overall, these optimizations lead to a notable reduction in operation times for cryptographic computations.
6.3 KiB
Verification Performance Analysis: NextP256K vs P256K1
Summary
NextP256K's verification is 4.7x faster than p256k1 (40,017 ns/op vs 186,054 ns/op) because it uses libsecp256k1's highly optimized C implementation, while p256k1 uses a simple binary multiplication algorithm.
Root Cause
The performance bottleneck is in EcmultConst, which is used to compute e*P during Schnorr verification.
Schnorr Verification Algorithm
// SchnorrVerify verifies a Schnorr signature following BIP-340
func SchnorrVerify(sig64 []byte, msg32 []byte, xonlyPubkey *XOnlyPubkey) bool {
// ... validation ...
// Compute R = s*G - e*P
// First compute s*G
var sG GroupElementJacobian
EcmultGen(&sG, &s) // Fast: uses optimized precomputed tables
// Compute e*P where P is the x-only pubkey
var eP GroupElementJacobian
EcmultConst(&eP, &pk, &e) // Slow: uses simple binary method
// ... rest of verification ...
}
Performance Breakdown
-
s*G computation (
EcmultGen):- Uses 8-bit byte-based precomputed tables
- Highly optimized: ~58,618 ns/op for pubkey derivation
- Fast because the generator point G is fixed and precomputed
-
e*P computation (
EcmultConst):- Uses simple binary method with 256 iterations
- Each iteration: double, check bit, potentially add
- This is the bottleneck
Current EcmultConst Implementation
// EcmultConst computes r = q * a using constant-time multiplication
// This is a simplified implementation for Phase 3 - can be optimized later
func EcmultConst(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
// ... edge cases ...
// Process bits from MSB to LSB
for i := 0; i < 256; i++ {
if i > 0 {
r.double(r)
}
// Get bit i (from MSB)
bit := q.getBits(uint(255-i), 1)
if bit != 0 {
if r.isInfinity() {
*r = base
} else {
r.addVar(r, &base)
}
}
}
}
Problem: This performs 256 iterations, each requiring:
- One field element doubling operation
- One bit extraction
- Potentially one point addition
For verification, this means 256 doublings + up to 256 additions per verification, which is extremely inefficient.
Why NextP256K is Faster
NextP256K uses libsecp256k1's optimized C implementation (secp256k1_ecmult_const) which:
-
Uses GLV Endomorphism:
- Splits the scalar into two smaller components using the curve's endomorphism
- Computes two smaller multiplications instead of one large one
- Reduces the effective bit length from 256 to ~128 bits per component
-
Windowed Precomputation:
- Precomputes a table of multiples of the base point
- Uses windowed lookups instead of processing bits one at a time
- Processes multiple bits per iteration (typically 4-6 bits at a time)
-
Signed-Digit Multi-Comb Algorithm:
- Uses a more efficient representation that reduces the number of additions
- Minimizes the number of point operations required
-
Assembly Optimizations:
- Field arithmetic operations are optimized in assembly
- Hand-tuned for specific CPU architectures
Reference Implementation
The C reference shows the complexity:
static void secp256k1_ecmult_const(secp256k1_gej *r, const secp256k1_ge *a, const secp256k1_scalar *q) {
/* The approach below combines the signed-digit logic from Mike Hamburg's
* "Fast and compact elliptic-curve cryptography" (https://eprint.iacr.org/2012/309)
* Section 3.3, with the GLV endomorphism.
* ... */
/* Precompute table for base point and lambda * base point */
/* Process bits in groups using windowed lookups */
for (group = ECMULT_CONST_GROUPS - 1; group >= 0; --group) {
/* Lookup precomputed points */
ECMULT_CONST_TABLE_GET_GE(&t, pre_a, bits1);
/* ... */
}
}
Performance Impact
Benchmark Results
| Operation | P256K1 | NextP256K | Speedup |
|---|---|---|---|
| Verification | 186,054 ns/op | 40,017 ns/op | 4.7x |
| Signing | 31,937 ns/op | 52,060 ns/op | 0.6x (slower) |
| Pubkey Derivation | 58,618 ns/op | 280,835 ns/op | 0.2x (slower) |
Note: NextP256K is slower for signing and pubkey derivation due to CGO overhead for smaller operations, but much faster for verification because the computation is more complex.
Optimization Opportunities
To improve p256k1's verification performance, EcmultConst should be optimized to:
-
Implement GLV Endomorphism:
- Split scalar using secp256k1's endomorphism
- Compute two smaller multiplications
- Combine results
-
Add Windowed Precomputation:
- Precompute a table of multiples of the base point
- Process bits in groups (windows) instead of individually
- Use lookup tables instead of repeated additions
-
Consider Variable-Time Optimization:
- For verification (public operation), variable-time algorithms are acceptable
- Could use
Ecmultinstead ofEcmultConstif constant-time isn't required
-
Implement Signed-Digit Representation:
- Use signed-digit multi-comb algorithm
- Reduce the number of additions required
Complexity Comparison
Current (Simple Binary Method)
- Operations: O(256) doublings + O(256) additions (worst case)
- Complexity: ~256 point operations
Optimized (Windowed + GLV)
- Operations: O(64) doublings + O(16) additions (with window size 4)
- Complexity: ~80 point operations (4x improvement)
With Assembly Optimizations
- Additional: 2-3x speedup from optimized field arithmetic
- Total: ~10-15x faster than simple binary method
Conclusion
The 4.7x performance difference is primarily due to:
- Algorithmic efficiency: Windowed multiplication vs. simple binary method
- GLV endomorphism: Splitting scalar into smaller components
- Assembly optimizations: Hand-tuned field arithmetic in C
- Better memory access patterns: Precomputed tables vs. repeated computations
The optimization is non-trivial and would require implementing:
- GLV endomorphism support
- Windowed precomputation tables
- Signed-digit multi-comb algorithm
- Potentially assembly optimizations for field arithmetic
For now, NextP256K's advantage in verification is expected given its use of the mature, highly optimized libsecp256k1 C library.