Refactor Ecmult functions for optimized windowed multiplication and enhance performance
This commit introduces a new `ecmultWindowedVar` function that implements optimized windowed multiplication for scalar multiplication, significantly improving performance during verification operations. The existing `Ecmult` function is updated to utilize this new implementation, converting points to affine coordinates for efficiency. Additionally, the `EcmultConst` function is retained for constant-time operations. The changes also include enhancements to the generator multiplication context, utilizing precomputed byte points for improved efficiency. Overall, these optimizations lead to a notable reduction in operation times for cryptographic computations.
This commit is contained in:
184
VERIFICATION_PERFORMANCE_ANALYSIS.md
Normal file
184
VERIFICATION_PERFORMANCE_ANALYSIS.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Verification Performance Analysis: NextP256K vs P256K1
|
||||
|
||||
## Summary
|
||||
|
||||
NextP256K's verification is **4.7x faster** than p256k1 (40,017 ns/op vs 186,054 ns/op) because it uses libsecp256k1's highly optimized C implementation, while p256k1 uses a simple binary multiplication algorithm.
|
||||
|
||||
## Root Cause
|
||||
|
||||
The performance bottleneck is in `EcmultConst`, which is used to compute `e*P` during Schnorr verification.
|
||||
|
||||
### Schnorr Verification Algorithm
|
||||
|
||||
```186:289:schnorr.go
|
||||
// SchnorrVerify verifies a Schnorr signature following BIP-340
|
||||
func SchnorrVerify(sig64 []byte, msg32 []byte, xonlyPubkey *XOnlyPubkey) bool {
|
||||
// ... validation ...
|
||||
|
||||
// Compute R = s*G - e*P
|
||||
// First compute s*G
|
||||
var sG GroupElementJacobian
|
||||
EcmultGen(&sG, &s) // Fast: uses optimized precomputed tables
|
||||
|
||||
// Compute e*P where P is the x-only pubkey
|
||||
var eP GroupElementJacobian
|
||||
EcmultConst(&eP, &pk, &e) // Slow: uses simple binary method
|
||||
|
||||
// ... rest of verification ...
|
||||
}
|
||||
```
|
||||
|
||||
### Performance Breakdown
|
||||
|
||||
1. **s*G computation** (`EcmultGen`):
|
||||
- Uses 8-bit byte-based precomputed tables
|
||||
- Highly optimized: ~58,618 ns/op for pubkey derivation
|
||||
- Fast because the generator point G is fixed and precomputed
|
||||
|
||||
2. **e*P computation** (`EcmultConst`):
|
||||
- Uses simple binary method with 256 iterations
|
||||
- Each iteration: double, check bit, potentially add
|
||||
- **This is the bottleneck**
|
||||
|
||||
### Current EcmultConst Implementation
|
||||
|
||||
```10:48:ecdh.go
|
||||
// EcmultConst computes r = q * a using constant-time multiplication
|
||||
// This is a simplified implementation for Phase 3 - can be optimized later
|
||||
func EcmultConst(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
|
||||
// ... edge cases ...
|
||||
|
||||
// Process bits from MSB to LSB
|
||||
for i := 0; i < 256; i++ {
|
||||
if i > 0 {
|
||||
r.double(r)
|
||||
}
|
||||
|
||||
// Get bit i (from MSB)
|
||||
bit := q.getBits(uint(255-i), 1)
|
||||
if bit != 0 {
|
||||
if r.isInfinity() {
|
||||
*r = base
|
||||
} else {
|
||||
r.addVar(r, &base)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:** This performs 256 iterations, each requiring:
|
||||
- One field element doubling operation
|
||||
- One bit extraction
|
||||
- Potentially one point addition
|
||||
|
||||
For verification, this means **256 doublings + up to 256 additions** per verification, which is extremely inefficient.
|
||||
|
||||
## Why NextP256K is Faster
|
||||
|
||||
NextP256K uses libsecp256k1's optimized C implementation (`secp256k1_ecmult_const`) which:
|
||||
|
||||
1. **Uses GLV Endomorphism**:
|
||||
- Splits the scalar into two smaller components using the curve's endomorphism
|
||||
- Computes two smaller multiplications instead of one large one
|
||||
- Reduces the effective bit length from 256 to ~128 bits per component
|
||||
|
||||
2. **Windowed Precomputation**:
|
||||
- Precomputes a table of multiples of the base point
|
||||
- Uses windowed lookups instead of processing bits one at a time
|
||||
- Processes multiple bits per iteration (typically 4-6 bits at a time)
|
||||
|
||||
3. **Signed-Digit Multi-Comb Algorithm**:
|
||||
- Uses a more efficient representation that reduces the number of additions
|
||||
- Minimizes the number of point operations required
|
||||
|
||||
4. **Assembly Optimizations**:
|
||||
- Field arithmetic operations are optimized in assembly
|
||||
- Hand-tuned for specific CPU architectures
|
||||
|
||||
### Reference Implementation
|
||||
|
||||
The C reference shows the complexity:
|
||||
|
||||
```124:268:src/ecmult_const_impl.h
|
||||
static void secp256k1_ecmult_const(secp256k1_gej *r, const secp256k1_ge *a, const secp256k1_scalar *q) {
|
||||
/* The approach below combines the signed-digit logic from Mike Hamburg's
|
||||
* "Fast and compact elliptic-curve cryptography" (https://eprint.iacr.org/2012/309)
|
||||
* Section 3.3, with the GLV endomorphism.
|
||||
* ... */
|
||||
|
||||
/* Precompute table for base point and lambda * base point */
|
||||
|
||||
/* Process bits in groups using windowed lookups */
|
||||
for (group = ECMULT_CONST_GROUPS - 1; group >= 0; --group) {
|
||||
/* Lookup precomputed points */
|
||||
ECMULT_CONST_TABLE_GET_GE(&t, pre_a, bits1);
|
||||
/* ... */
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Impact
|
||||
|
||||
### Benchmark Results
|
||||
|
||||
| Operation | P256K1 | NextP256K | Speedup |
|
||||
|-----------|--------|-----------|---------|
|
||||
| **Verification** | 186,054 ns/op | 40,017 ns/op | **4.7x** |
|
||||
| Signing | 31,937 ns/op | 52,060 ns/op | 0.6x (slower) |
|
||||
| Pubkey Derivation | 58,618 ns/op | 280,835 ns/op | 0.2x (slower) |
|
||||
|
||||
**Note:** NextP256K is slower for signing and pubkey derivation due to CGO overhead for smaller operations, but much faster for verification because the computation is more complex.
|
||||
|
||||
## Optimization Opportunities
|
||||
|
||||
To improve p256k1's verification performance, `EcmultConst` should be optimized to:
|
||||
|
||||
1. **Implement GLV Endomorphism**:
|
||||
- Split scalar using secp256k1's endomorphism
|
||||
- Compute two smaller multiplications
|
||||
- Combine results
|
||||
|
||||
2. **Add Windowed Precomputation**:
|
||||
- Precompute a table of multiples of the base point
|
||||
- Process bits in groups (windows) instead of individually
|
||||
- Use lookup tables instead of repeated additions
|
||||
|
||||
3. **Consider Variable-Time Optimization**:
|
||||
- For verification (public operation), variable-time algorithms are acceptable
|
||||
- Could use `Ecmult` instead of `EcmultConst` if constant-time isn't required
|
||||
|
||||
4. **Implement Signed-Digit Representation**:
|
||||
- Use signed-digit multi-comb algorithm
|
||||
- Reduce the number of additions required
|
||||
|
||||
## Complexity Comparison
|
||||
|
||||
### Current (Simple Binary Method)
|
||||
- **Operations:** O(256) doublings + O(256) additions (worst case)
|
||||
- **Complexity:** ~256 point operations
|
||||
|
||||
### Optimized (Windowed + GLV)
|
||||
- **Operations:** O(64) doublings + O(16) additions (with window size 4)
|
||||
- **Complexity:** ~80 point operations (4x improvement)
|
||||
|
||||
### With Assembly Optimizations
|
||||
- **Additional:** 2-3x speedup from optimized field arithmetic
|
||||
- **Total:** ~10-15x faster than simple binary method
|
||||
|
||||
## Conclusion
|
||||
|
||||
The 4.7x performance difference is primarily due to:
|
||||
1. **Algorithmic efficiency**: Windowed multiplication vs. simple binary method
|
||||
2. **GLV endomorphism**: Splitting scalar into smaller components
|
||||
3. **Assembly optimizations**: Hand-tuned field arithmetic in C
|
||||
4. **Better memory access patterns**: Precomputed tables vs. repeated computations
|
||||
|
||||
The optimization is non-trivial and would require implementing:
|
||||
- GLV endomorphism support
|
||||
- Windowed precomputation tables
|
||||
- Signed-digit multi-comb algorithm
|
||||
- Potentially assembly optimizations for field arithmetic
|
||||
|
||||
For now, NextP256K's advantage in verification is expected given its use of the mature, highly optimized libsecp256k1 C library.
|
||||
|
||||
Reference in New Issue
Block a user