This commit introduces a new `ecmultWindowedVar` function that implements optimized windowed multiplication for scalar multiplication, significantly improving performance during verification operations. The existing `Ecmult` function is updated to utilize this new implementation, converting points to affine coordinates for efficiency. Additionally, the `EcmultConst` function is retained for constant-time operations. The changes also include enhancements to the generator multiplication context, utilizing precomputed byte points for improved efficiency. Overall, these optimizations lead to a notable reduction in operation times for cryptographic computations.
185 lines
6.3 KiB
Markdown
185 lines
6.3 KiB
Markdown
# Verification Performance Analysis: NextP256K vs P256K1
|
|
|
|
## Summary
|
|
|
|
NextP256K's verification is **4.7x faster** than p256k1 (40,017 ns/op vs 186,054 ns/op) because it uses libsecp256k1's highly optimized C implementation, while p256k1 uses a simple binary multiplication algorithm.
|
|
|
|
## Root Cause
|
|
|
|
The performance bottleneck is in `EcmultConst`, which is used to compute `e*P` during Schnorr verification.
|
|
|
|
### Schnorr Verification Algorithm
|
|
|
|
```186:289:schnorr.go
|
|
// SchnorrVerify verifies a Schnorr signature following BIP-340
|
|
func SchnorrVerify(sig64 []byte, msg32 []byte, xonlyPubkey *XOnlyPubkey) bool {
|
|
// ... validation ...
|
|
|
|
// Compute R = s*G - e*P
|
|
// First compute s*G
|
|
var sG GroupElementJacobian
|
|
EcmultGen(&sG, &s) // Fast: uses optimized precomputed tables
|
|
|
|
// Compute e*P where P is the x-only pubkey
|
|
var eP GroupElementJacobian
|
|
EcmultConst(&eP, &pk, &e) // Slow: uses simple binary method
|
|
|
|
// ... rest of verification ...
|
|
}
|
|
```
|
|
|
|
### Performance Breakdown
|
|
|
|
1. **s*G computation** (`EcmultGen`):
|
|
- Uses 8-bit byte-based precomputed tables
|
|
- Highly optimized: ~58,618 ns/op for pubkey derivation
|
|
- Fast because the generator point G is fixed and precomputed
|
|
|
|
2. **e*P computation** (`EcmultConst`):
|
|
- Uses simple binary method with 256 iterations
|
|
- Each iteration: double, check bit, potentially add
|
|
- **This is the bottleneck**
|
|
|
|
### Current EcmultConst Implementation
|
|
|
|
```10:48:ecdh.go
|
|
// EcmultConst computes r = q * a using constant-time multiplication
|
|
// This is a simplified implementation for Phase 3 - can be optimized later
|
|
func EcmultConst(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
|
|
// ... edge cases ...
|
|
|
|
// Process bits from MSB to LSB
|
|
for i := 0; i < 256; i++ {
|
|
if i > 0 {
|
|
r.double(r)
|
|
}
|
|
|
|
// Get bit i (from MSB)
|
|
bit := q.getBits(uint(255-i), 1)
|
|
if bit != 0 {
|
|
if r.isInfinity() {
|
|
*r = base
|
|
} else {
|
|
r.addVar(r, &base)
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Problem:** This performs 256 iterations, each requiring:
|
|
- One field element doubling operation
|
|
- One bit extraction
|
|
- Potentially one point addition
|
|
|
|
For verification, this means **256 doublings + up to 256 additions** per verification, which is extremely inefficient.
|
|
|
|
## Why NextP256K is Faster
|
|
|
|
NextP256K uses libsecp256k1's optimized C implementation (`secp256k1_ecmult_const`) which:
|
|
|
|
1. **Uses GLV Endomorphism**:
|
|
- Splits the scalar into two smaller components using the curve's endomorphism
|
|
- Computes two smaller multiplications instead of one large one
|
|
- Reduces the effective bit length from 256 to ~128 bits per component
|
|
|
|
2. **Windowed Precomputation**:
|
|
- Precomputes a table of multiples of the base point
|
|
- Uses windowed lookups instead of processing bits one at a time
|
|
- Processes multiple bits per iteration (typically 4-6 bits at a time)
|
|
|
|
3. **Signed-Digit Multi-Comb Algorithm**:
|
|
- Uses a more efficient representation that reduces the number of additions
|
|
- Minimizes the number of point operations required
|
|
|
|
4. **Assembly Optimizations**:
|
|
- Field arithmetic operations are optimized in assembly
|
|
- Hand-tuned for specific CPU architectures
|
|
|
|
### Reference Implementation
|
|
|
|
The C reference shows the complexity:
|
|
|
|
```124:268:src/ecmult_const_impl.h
|
|
static void secp256k1_ecmult_const(secp256k1_gej *r, const secp256k1_ge *a, const secp256k1_scalar *q) {
|
|
/* The approach below combines the signed-digit logic from Mike Hamburg's
|
|
* "Fast and compact elliptic-curve cryptography" (https://eprint.iacr.org/2012/309)
|
|
* Section 3.3, with the GLV endomorphism.
|
|
* ... */
|
|
|
|
/* Precompute table for base point and lambda * base point */
|
|
|
|
/* Process bits in groups using windowed lookups */
|
|
for (group = ECMULT_CONST_GROUPS - 1; group >= 0; --group) {
|
|
/* Lookup precomputed points */
|
|
ECMULT_CONST_TABLE_GET_GE(&t, pre_a, bits1);
|
|
/* ... */
|
|
}
|
|
}
|
|
```
|
|
|
|
## Performance Impact
|
|
|
|
### Benchmark Results
|
|
|
|
| Operation | P256K1 | NextP256K | Speedup |
|
|
|-----------|--------|-----------|---------|
|
|
| **Verification** | 186,054 ns/op | 40,017 ns/op | **4.7x** |
|
|
| Signing | 31,937 ns/op | 52,060 ns/op | 0.6x (slower) |
|
|
| Pubkey Derivation | 58,618 ns/op | 280,835 ns/op | 0.2x (slower) |
|
|
|
|
**Note:** NextP256K is slower for signing and pubkey derivation due to CGO overhead for smaller operations, but much faster for verification because the computation is more complex.
|
|
|
|
## Optimization Opportunities
|
|
|
|
To improve p256k1's verification performance, `EcmultConst` should be optimized to:
|
|
|
|
1. **Implement GLV Endomorphism**:
|
|
- Split scalar using secp256k1's endomorphism
|
|
- Compute two smaller multiplications
|
|
- Combine results
|
|
|
|
2. **Add Windowed Precomputation**:
|
|
- Precompute a table of multiples of the base point
|
|
- Process bits in groups (windows) instead of individually
|
|
- Use lookup tables instead of repeated additions
|
|
|
|
3. **Consider Variable-Time Optimization**:
|
|
- For verification (public operation), variable-time algorithms are acceptable
|
|
- Could use `Ecmult` instead of `EcmultConst` if constant-time isn't required
|
|
|
|
4. **Implement Signed-Digit Representation**:
|
|
- Use signed-digit multi-comb algorithm
|
|
- Reduce the number of additions required
|
|
|
|
## Complexity Comparison
|
|
|
|
### Current (Simple Binary Method)
|
|
- **Operations:** O(256) doublings + O(256) additions (worst case)
|
|
- **Complexity:** ~256 point operations
|
|
|
|
### Optimized (Windowed + GLV)
|
|
- **Operations:** O(64) doublings + O(16) additions (with window size 4)
|
|
- **Complexity:** ~80 point operations (4x improvement)
|
|
|
|
### With Assembly Optimizations
|
|
- **Additional:** 2-3x speedup from optimized field arithmetic
|
|
- **Total:** ~10-15x faster than simple binary method
|
|
|
|
## Conclusion
|
|
|
|
The 4.7x performance difference is primarily due to:
|
|
1. **Algorithmic efficiency**: Windowed multiplication vs. simple binary method
|
|
2. **GLV endomorphism**: Splitting scalar into smaller components
|
|
3. **Assembly optimizations**: Hand-tuned field arithmetic in C
|
|
4. **Better memory access patterns**: Precomputed tables vs. repeated computations
|
|
|
|
The optimization is non-trivial and would require implementing:
|
|
- GLV endomorphism support
|
|
- Windowed precomputation tables
|
|
- Signed-digit multi-comb algorithm
|
|
- Potentially assembly optimizations for field arithmetic
|
|
|
|
For now, NextP256K's advantage in verification is expected given its use of the mature, highly optimized libsecp256k1 C library.
|
|
|