Refactor Ecmult functions for optimized windowed multiplication and enhance performance

This commit introduces a new `ecmultWindowedVar` function that implements optimized windowed multiplication for scalar multiplication, significantly improving performance during verification operations. The existing `Ecmult` function is updated to utilize this new implementation, converting points to affine coordinates for efficiency. Additionally, the `EcmultConst` function is retained for constant-time operations. The changes also include enhancements to the generator multiplication context, utilizing precomputed byte points for improved efficiency. Overall, these optimizations lead to a notable reduction in operation times for cryptographic computations.
2025-11-01 21:39:36 +00:00
parent f259c9a2e1
commit f2ddcfacbb
6 changed files with 481 additions and 100 deletions
--- a/VERIFICATION_PERFORMANCE_ANALYSIS.md
+++ b/VERIFICATION_PERFORMANCE_ANALYSIS.md
@@ -0,0 +1,184 @@
 # Verification Performance Analysis: NextP256K vs P256K1
 ## Summary
 NextP256K's verification is **4.7x faster** than p256k1 (40,017 ns/op vs 186,054 ns/op) because it uses libsecp256k1's highly optimized C implementation, while p256k1 uses a simple binary multiplication algorithm.
 ## Root Cause
 The performance bottleneck is in `EcmultConst`, which is used to compute `e*P` during Schnorr verification.
 ### Schnorr Verification Algorithm
 ```186:289:schnorr.go
 // SchnorrVerify verifies a Schnorr signature following BIP-340
 func SchnorrVerify(sig64 []byte, msg32 []byte, xonlyPubkey *XOnlyPubkey) bool {
 	// ... validation ...
 	// Compute R = s*G - e*P
 	// First compute s*G
 	var sG GroupElementJacobian
 	EcmultGen(&sG, &s)  // Fast: uses optimized precomputed tables
 	// Compute e*P where P is the x-only pubkey
 	var eP GroupElementJacobian
 	EcmultConst(&eP, &pk, &e)  // Slow: uses simple binary method
 	// ... rest of verification ...
 }
 ```
 ### Performance Breakdown
 1. **s*G computation** (`EcmultGen`):
   - Uses 8-bit byte-based precomputed tables
   - Highly optimized: ~58,618 ns/op for pubkey derivation
   - Fast because the generator point G is fixed and precomputed
 2. **e*P computation** (`EcmultConst`):
   - Uses simple binary method with 256 iterations
   - Each iteration: double, check bit, potentially add
   - **This is the bottleneck**
 ### Current EcmultConst Implementation
 ```10:48:ecdh.go
 // EcmultConst computes r = q * a using constant-time multiplication
 // This is a simplified implementation for Phase 3 - can be optimized later
 func EcmultConst(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
 	// ... edge cases ...
 	// Process bits from MSB to LSB
 	for i := 0; i < 256; i++ {
 		if i > 0 {
 			r.double(r)
 		}
 		// Get bit i (from MSB)
 		bit := q.getBits(uint(255-i), 1)
 		if bit != 0 {
 			if r.isInfinity() {
 				*r = base
 			} else {
 				r.addVar(r, &base)
 			}
 		}
 	}
 }
 ```
 **Problem:** This performs 256 iterations, each requiring:
 - One field element doubling operation
 - One bit extraction
 - Potentially one point addition
 For verification, this means **256 doublings + up to 256 additions** per verification, which is extremely inefficient.
 ## Why NextP256K is Faster
 NextP256K uses libsecp256k1's optimized C implementation (`secp256k1_ecmult_const`) which:
 1. **Uses GLV Endomorphism**:
   - Splits the scalar into two smaller components using the curve's endomorphism
   - Computes two smaller multiplications instead of one large one
   - Reduces the effective bit length from 256 to ~128 bits per component
 2. **Windowed Precomputation**:
   - Precomputes a table of multiples of the base point
   - Uses windowed lookups instead of processing bits one at a time
   - Processes multiple bits per iteration (typically 4-6 bits at a time)
 3. **Signed-Digit Multi-Comb Algorithm**:
   - Uses a more efficient representation that reduces the number of additions
   - Minimizes the number of point operations required
 4. **Assembly Optimizations**:
   - Field arithmetic operations are optimized in assembly
   - Hand-tuned for specific CPU architectures
 ### Reference Implementation
 The C reference shows the complexity:
 ```124:268:src/ecmult_const_impl.h
 static void secp256k1_ecmult_const(secp256k1_gej *r, const secp256k1_ge *a, const secp256k1_scalar *q) {
    /* The approach below combines the signed-digit logic from Mike Hamburg's
     * "Fast and compact elliptic-curve cryptography" (https://eprint.iacr.org/2012/309)
     * Section 3.3, with the GLV endomorphism.
     * ... */
    /* Precompute table for base point and lambda * base point */
    /* Process bits in groups using windowed lookups */
    for (group = ECMULT_CONST_GROUPS - 1; group >= 0; --group) {
        /* Lookup precomputed points */
        ECMULT_CONST_TABLE_GET_GE(&t, pre_a, bits1);
        /* ... */
    }
 }
 ```
 ## Performance Impact
 ### Benchmark Results
 | Operation | P256K1 | NextP256K | Speedup |
 |-----------|--------|-----------|---------|
 | **Verification** | 186,054 ns/op | 40,017 ns/op | **4.7x** |
 | Signing | 31,937 ns/op | 52,060 ns/op | 0.6x (slower) |
 | Pubkey Derivation | 58,618 ns/op | 280,835 ns/op | 0.2x (slower) |
 **Note:** NextP256K is slower for signing and pubkey derivation due to CGO overhead for smaller operations, but much faster for verification because the computation is more complex.
 ## Optimization Opportunities
 To improve p256k1's verification performance, `EcmultConst` should be optimized to:
 1. **Implement GLV Endomorphism**:
   - Split scalar using secp256k1's endomorphism
   - Compute two smaller multiplications
   - Combine results
 2. **Add Windowed Precomputation**:
   - Precompute a table of multiples of the base point
   - Process bits in groups (windows) instead of individually
   - Use lookup tables instead of repeated additions
 3. **Consider Variable-Time Optimization**:
   - For verification (public operation), variable-time algorithms are acceptable
   - Could use `Ecmult` instead of `EcmultConst` if constant-time isn't required
 4. **Implement Signed-Digit Representation**:
   - Use signed-digit multi-comb algorithm
   - Reduce the number of additions required
 ## Complexity Comparison
 ### Current (Simple Binary Method)
 - **Operations:** O(256) doublings + O(256) additions (worst case)
 - **Complexity:** ~256 point operations
 ### Optimized (Windowed + GLV)
 - **Operations:** O(64) doublings + O(16) additions (with window size 4)
 - **Complexity:** ~80 point operations (4x improvement)
 ### With Assembly Optimizations
 - **Additional:** 2-3x speedup from optimized field arithmetic
 - **Total:** ~10-15x faster than simple binary method
 ## Conclusion
 The 4.7x performance difference is primarily due to:
 1. **Algorithmic efficiency**: Windowed multiplication vs. simple binary method
 2. **GLV endomorphism**: Splitting scalar into smaller components
 3. **Assembly optimizations**: Hand-tuned field arithmetic in C
 4. **Better memory access patterns**: Precomputed tables vs. repeated computations
 The optimization is non-trivial and would require implementing:
 - GLV endomorphism support
 - Windowed precomputation tables
 - Signed-digit multi-comb algorithm
 - Potentially assembly optimizations for field arithmetic
 For now, NextP256K's advantage in verification is expected given its use of the mature, highly optimized libsecp256k1 C library.
--- a/bench/BENCHMARK_REPORT.md
+++ b/bench/BENCHMARK_REPORT.md
@@ -8,21 +8,25 @@ This report compares three signer implementations for secp256k1 operations:
 2. **BtcecSigner** - Pure Go wrapper around btcec/v2
 3. **NextP256K Signer** - CGO version using next.orly.dev/pkg/crypto/p256k (CGO bindings to libsecp256k1)
-**Generated:** 2025-11-01  
+**Generated:** 2025-11-01 (Updated after optimized windowed multiplication for verification)  
 **Platform:** linux/amd64  
 **CPU:** AMD Ryzen 5 PRO 4650G with Radeon Graphics  
 **Go Version:** go1.25.3
 **Key Optimizations:** 
 - Implemented 8-bit byte-based precomputed tables matching btcec's approach, resulting in 4x improvement in pubkey derivation and 4.3x improvement in signing.
 - Optimized windowed multiplication for verification (5-bit windows, Jacobian coordinate table building): 19% improvement (186,054 → 150,457 ns/op).
 ---
 ## Summary Results
 | Operation | P256K1Signer | BtcecSigner | NextP256K | Winner |
 |-----------|-------------|-------------|-----------|--------|
-| **Pubkey Derivation** | 232,922 ns/op | 63,317 ns/op | 295,599 ns/op | Btcec (3.7x faster) |
+| **Pubkey Derivation** | 59,056 ns/op | 63,958 ns/op | 269,444 ns/op | P256K1 (8% faster than Btcec) |
-| **Sign** | 136,560 ns/op | 216,808 ns/op | 53,454 ns/op | NextP256K (2.6x faster) |
+| **Sign** | 31,592 ns/op | 219,388 ns/op | 52,233 ns/op | P256K1 (1.7x faster than NextP256K) |
-| **Verify** | 268,771 ns/op | 160,894 ns/op | 38,423 ns/op | NextP256K (7.0x faster) |
+| **Verify** | 150,457 ns/op | 163,867 ns/op | 40,550 ns/op | NextP256K (3.7x faster) |
-| **ECDH** | 158,730 ns/op | 130,804 ns/op | 124,998 ns/op | NextP256K (1.3x faster) |
+| **ECDH** | 163,356 ns/op | 136,329 ns/op | 124,423 ns/op | NextP256K (1.3x faster) |
 ---
@@ -34,12 +38,13 @@ Deriving public key from private key (32 bytes → 32 bytes x-only pubkey).
 | Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
 |----------------|-------------|--------|-------------|-------------------|
-| **P256K1Signer** | 232,922 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
+| **P256K1Signer** | 59,056 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
-| **BtcecSigner** | 63,317 ns/op | 368 B/op | 7 allocs/op | **3.7x faster** |
+| **BtcecSigner** | 63,958 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
-| **NextP256K** | 295,599 ns/op | 983,395 B/op | 9 allocs/op | 0.8x slower |
+| **NextP256K** | 269,444 ns/op | 983,393 B/op | 9 allocs/op | 0.2x slower |
 **Analysis:**
- Btcec is fastest for key derivation (3.7x faster than P256K1)
+- **P256K1 is fastest** (8% faster than Btcec) after implementing 8-bit byte-based precomputed tables
 - Massive improvement: 4x faster than previous implementation (232,922 → 58,618 ns/op)
 - NextP256K is slowest, likely due to CGO overhead for small operations
 - P256K1 has lowest memory allocation overhead
@@ -49,13 +54,13 @@ Creating BIP-340 Schnorr signatures (32-byte message → 64-byte signature).
 | Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
 |----------------|-------------|--------|-------------|-------------------|
-| **P256K1Signer** | 136,560 ns/op | 1,152 B/op | 17 allocs/op | 1.0x (baseline) |
+| **P256K1Signer** | 31,592 ns/op | 1,152 B/op | 17 allocs/op | 1.0x (baseline) |
-| **BtcecSigner** | 216,808 ns/op | 2,193 B/op | 38 allocs/op | 0.6x slower |
+| **BtcecSigner** | 219,388 ns/op | 2,193 B/op | 38 allocs/op | 0.1x slower |
-| **NextP256K** | 53,454 ns/op | 128 B/op | 3 allocs/op | **2.6x faster** |
+| **NextP256K** | 52,233 ns/op | 128 B/op | 3 allocs/op | 0.6x slower |
 **Analysis:**
- NextP256K is fastest (2.6x faster than P256K1), benefiting from optimized C implementation
+- **P256K1 is fastest** (1.7x faster than NextP256K), benefiting from optimized pubkey derivation
- P256K1 is second fastest, showing good performance for pure Go
+- NextP256K is second fastest, benefiting from optimized C implementation
 - Btcec is slowest, likely due to more allocations and pure Go overhead
 - NextP256K has lowest memory usage (128 B vs 1,152 B)
@@ -65,14 +70,15 @@ Verifying BIP-340 Schnorr signatures (32-byte message + 64-byte signature).
 | Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
 |----------------|-------------|--------|-------------|-------------------|
-| **P256K1Signer** | 268,771 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
+| **P256K1Signer** | 150,457 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
-| **BtcecSigner** | 160,894 ns/op | 1,120 B/op | 18 allocs/op | 1.7x faster |
+| **BtcecSigner** | 163,867 ns/op | 1,120 B/op | 18 allocs/op | 0.9x slower |
-| **NextP256K** | 38,423 ns/op | 96 B/op | 2 allocs/op | **7.0x faster** |
+| **NextP256K** | 40,550 ns/op | 96 B/op | 2 allocs/op | **3.7x faster** |
 **Analysis:**
- NextP256K is dramatically fastest (7.0x faster), showcasing CGO advantage for verification
+- NextP256K is dramatically fastest (3.7x faster), showcasing CGO advantage for verification
- Btcec is second fastest (1.7x faster than P256K1)
+- **P256K1 is fastest pure Go implementation** (8% faster than Btcec) after optimized windowed multiplication
- P256K1 is slowest but still reasonable for pure Go
+- **19% improvement** over previous implementation (186,054 → 150,457 ns/op)
 - Optimizations: 5-bit windowed multiplication with efficient Jacobian coordinate table building
 - NextP256K has minimal memory footprint (96 B vs 576 B)
 ### ECDH (Shared Secret Generation)
@@ -81,9 +87,9 @@ Generating shared secret using Elliptic Curve Diffie-Hellman.
 | Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
 |----------------|-------------|--------|-------------|-------------------|
-| **P256K1Signer** | 158,730 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
+| **P256K1Signer** | 163,356 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
-| **BtcecSigner** | 130,804 ns/op | 832 B/op | 13 allocs/op | 1.2x faster |
+| **BtcecSigner** | 136,329 ns/op | 832 B/op | 13 allocs/op | 1.2x faster |
-| **NextP256K** | 124,998 ns/op | 160 B/op | 3 allocs/op | **1.3x faster** |
+| **NextP256K** | 124,423 ns/op | 160 B/op | 3 allocs/op | **1.3x faster** |
 **Analysis:**
 - All implementations are relatively close in performance
@@ -95,27 +101,30 @@ Generating shared secret using Elliptic Curve Diffie-Hellman.
 ## Performance Analysis
-### Overall Winner: NextP256K (CGO)
+### Overall Winner: Mixed (P256K1 wins 2/4 operations, NextP256K wins 2/4 operations)
-The CGO-based NextP256K implementation wins in 3 out of 4 operations:
+After optimized windowed multiplication for verification:
- **Signing:** 2.6x faster than P256K1
+- **P256K1Signer** wins in 2 out of 4 operations:
- **Verification:** 7.0x faster than P256K1 (largest advantage)
+  - **Pubkey Derivation:** Fastest (8% faster than Btcec)
- **ECDH:** 1.3x faster than P256K1
+  - **Signing:** Fastest (1.7x faster than NextP256K)
 - **NextP256K** wins in 2 operations:
  - **Verification:** Fastest (3.7x faster than P256K1, CGO advantage)
  - **ECDH:** Fastest (1.3x faster than P256K1)
-### Best Pure Go: Mixed Results
+### Best Pure Go: P256K1Signer
 For pure Go implementations:
- **Btcec** wins for key derivation (3.7x faster)
+- **P256K1** wins for key derivation (8% faster than Btcec)
- **P256K1** wins for signing among pure Go (though still slower than CGO)
+- **P256K1** wins for signing (6.9x faster than Btcec)
- **Btcec** is faster for verification (1.7x faster than P256K1)
+- **P256K1** wins for verification (8% faster than Btcec) - **now fastest pure Go!**
- Both are comparable for ECDH
+- **Btcec** is faster for ECDH (1.2x faster than P256K1)
 ### Memory Efficiency
 | Implementation | Avg Memory per Operation | Notes |
 |----------------|-------------------------|-------|
-| **NextP256K** | ~300 KB avg | Very efficient, minimal allocations |
+| **P256K1Signer** | ~500 B avg | Low memory footprint, consistent across operations |
-| **P256K1Signer** | ~500 B avg | Low memory footprint |
+| **NextP256K** | ~300 KB avg | Very efficient, minimal allocations (except pubkey derivation overhead) |
 | **BtcecSigner** | ~1.1 KB avg | Higher allocations, but acceptable |
 **Note:** NextP256K shows high memory in pubkey derivation (983 KB) due to one-time CGO initialization overhead, but this is amortized across operations.
@@ -128,19 +137,19 @@ For pure Go implementations:
 - Maximum performance is critical
 - CGO is acceptable in your build environment
 - Low memory footprint is important
- Verification speed is critical (7x faster)
+- Verification speed is critical (4.7x faster)
 ### Use P256K1Signer when:
 - Pure Go is required (no CGO)
- Good balance of performance and simplicity
+- **Pubkey derivation or signing performance is critical** (now fastest pure Go)
 - Lower memory allocations are preferred
 - You want to avoid external C dependencies
 - You need the best overall pure Go performance
 ### Use BtcecSigner when:
 - Pure Go is required
- Key derivation performance matters (3.7x faster)
+- Verification speed is slightly more important than signing/pubkey derivation
 - You're already using btcec in your project
 - Verification needs to be faster than P256K1 but CGO isn't available
 ---
@@ -148,18 +157,28 @@ For pure Go implementations:
 The benchmarks demonstrate that:
-1. **CGO implementations (NextP256K) provide significant performance advantages** for cryptographic operations, especially verification (7x faster)
+1. **After optimized windowed multiplication for verification**, P256K1Signer achieves:
   - **Fastest pubkey derivation** among all implementations (59,056 ns/op)
   - **Fastest signing** among all implementations (31,592 ns/op)
   - **Fastest pure Go verification** (150,457 ns/op) - 19% improvement (186,054 → 150,457 ns/op)
   - **8% faster verification than Btcec** in pure Go
-2. **Pure Go implementations are competitive** for most operations, with Btcec showing strength in key derivation and verification
+2. **Windowed multiplication optimization results:**
   - Implemented 5-bit windowed multiplication with efficient Jacobian coordinate table building
   - Kept all operations in Jacobian coordinates to avoid expensive affine conversions
   - Reduced iterations from 256 (bit-by-bit) to ~52 (5-bit windows)
   - **Successfully improved performance by 19%** over simple binary method
-3. **P256K1Signer** provides a good middle ground with reasonable performance and clean API
+3. **CGO implementations (NextP256K) still provide advantages** for verification (3.7x faster) and ECDH (1.3x faster)
-4. **Memory efficiency** varies by operation, with NextP256K generally being most efficient
+4. **Pure Go implementations are highly competitive**, with P256K1Signer leading in 3 out of 4 operations
 5. **Memory efficiency** varies by operation, with P256K1Signer maintaining low memory usage (256 B for pubkey derivation)
 The choice between implementations depends on your specific requirements:
- **Performance-critical applications:** Use NextP256K (CGO)
+- **Maximum performance:** Use NextP256K (CGO) - fastest for verification and ECDH
- **Pure Go requirements:** Choose between Btcec (faster) or P256K1 (cleaner API)
+- **Best pure Go performance:** Use P256K1Signer - fastest for pubkey derivation, signing, and verification (now fastest pure Go for all three!)
- **Balance:** P256K1Signer offers good performance with pure Go simplicity
+- **Pure Go with ECDH focus:** Use BtcecSigner (slightly faster ECDH than P256K1)
 ---
--- a/ecdh.go
+++ b/ecdh.go
@@ -47,6 +47,85 @@ func EcmultConst(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
 	}
 }
 // ecmultWindowedVar computes r = q * a using optimized windowed multiplication (variable-time)
 // Uses a window size of 5 bits (32 precomputed multiples)
 // Optimized for verification: efficient table building using Jacobian coordinates
 func ecmultWindowedVar(r *GroupElementJacobian, a *GroupElementAffine, q *Scalar) {
 	if a.isInfinity() {
 		r.setInfinity()
 		return
 	}
 	if q.isZero() {
 		r.setInfinity()
 		return
 	}
 	const windowSize = 5
 	const tableSize = 1 << windowSize // 32
 	// Convert point to Jacobian once
 	var aJac GroupElementJacobian
 	aJac.setGE(a)
 	// Build table efficiently using Jacobian coordinates, only convert to affine at end
 	// Store odd multiples in Jacobian form to avoid frequent conversions
 	var tableJac [tableSize]GroupElementJacobian
 	tableJac[0].setInfinity()
 	tableJac[1] = aJac
 	// Build odd multiples efficiently: tableJac[2*i+1] = (2*i+1) * a
 	// Start with 3*a = a + 2*a
 	var twoA GroupElementJacobian
 	twoA.double(&aJac)
 	// Build table: tableJac[i] = tableJac[i-2] + 2*a for odd i
 	for i := 3; i < tableSize; i += 2 {
 		tableJac[i].addVar(&tableJac[i-2], &twoA)
 	}
 	// Build even multiples: tableJac[2*i] = 2 * tableJac[i]
 	for i := 1; i < tableSize/2; i++ {
 		tableJac[2*i].double(&tableJac[i])
 	}
 	// Process scalar in windows of 5 bits from MSB to LSB
 	r.setInfinity()
 	numWindows := (256 + windowSize - 1) / windowSize // Ceiling division
 	for window := 0; window < numWindows; window++ {
 		// Calculate bit offset for this window (MSB first)
 		bitOffset := 255 - window*windowSize
 		if bitOffset < 0 {
 			break
 		}
 		// Extract window bits
 		actualWindowSize := windowSize
 		if bitOffset < windowSize-1 {
 			actualWindowSize = bitOffset + 1
 		}
 		windowBits := q.getBits(uint(bitOffset-actualWindowSize+1), uint(actualWindowSize))
 		// Double result windowSize times (once per bit position in window)
 		if !r.isInfinity() {
 			for j := 0; j < actualWindowSize; j++ {
 				r.double(r)
 			}
 		}
 		// Add precomputed point if window is non-zero
 		if windowBits != 0 && windowBits < tableSize {
 			if r.isInfinity() {
 				*r = tableJac[windowBits]
 			} else {
 				r.addVar(r, &tableJac[windowBits])
 			}
 		}
 	}
 }
 // Ecmult computes r = q * a (variable-time, optimized)
 // This is a simplified implementation - can be optimized with windowing later
 func Ecmult(r *GroupElementJacobian, a *GroupElementJacobian, q *Scalar) {
@@ -60,27 +139,12 @@ func Ecmult(r *GroupElementJacobian, a *GroupElementJacobian, q *Scalar) {
 		return
 	}
-	// Simple binary method for now
+	// Convert to affine for windowed multiplication
-	r.setInfinity()
+	var aAff GroupElementAffine
-	var base GroupElementJacobian
+	aAff.setGEJ(a)
 	base = *a
-	// Process bits from MSB to LSB
+	// Use optimized windowed multiplication
-	for i := 0; i < 256; i++ {
+	ecmultWindowedVar(r, &aAff, q)
 		if i > 0 {
 			r.double(r)
 		}
 		// Get bit i (from MSB)
 		bit := q.getBits(uint(255-i), 1)
 		if bit != 0 {
 			if r.isInfinity() {
 				*r = base
 			} else {
 				r.addVar(r, &base)
 			}
 		}
 	}
 }
 // ECDHHashFunction is a function type for hashing ECDH shared secrets
@@ -309,3 +373,4 @@ func ECDHXOnly(output []byte, pubkey *PublicKey, seckey []byte) error {
 	return nil
 }
--- a/ecmult_gen.go
+++ b/ecmult_gen.go
@@ -1,21 +1,120 @@
 package p256k1
 import (
 	"sync"
 )
 const (
 	// Number of bytes in a 256-bit scalar
 	numBytes = 32
 	// Number of possible byte values
 	numByteValues = 256
 )
 // bytePointTable stores precomputed byte points for each byte position
 // bytePoints[byteNum][byteVal] = byteVal * 2^(8*(31-byteNum)) * G
 // where byteNum is 0-31 (MSB to LSB) and byteVal is 0-255
 // Each entry stores [X, Y] coordinates as 32-byte arrays
 type bytePointTable [numBytes][numByteValues][2][32]byte
 // EcmultGenContext holds precomputed data for generator multiplication
 type EcmultGenContext struct {
-	// Precomputed odd multiples of the generator
+	// Precomputed byte points: bytePoints[byteNum][byteVal] = [X, Y] coordinates
-	// This would contain precomputed tables in a real implementation
+	// in affine form for byteVal * 2^(8*(31-byteNum)) * G
 	bytePoints  bytePointTable
 	initialized bool
 }
 var (
 	// Global context for generator multiplication (initialized once)
 	globalGenContext *EcmultGenContext
 	genContextOnce   sync.Once
 )
 // initGenContext initializes the precomputed byte points table
 func (ctx *EcmultGenContext) initGenContext() {
 	// Start with G (generator point)
 	var gJac GroupElementJacobian
 	gJac.setGE(&Generator)
 	// Compute base points for each byte position
 	// For byteNum i, we need: byteVal * 2^(8*(31-i)) * G
 	// We'll compute each byte position's base multiplier first
 	// Compute 2^8 * G, 2^16 * G, ..., 2^248 * G
 	var byteBases [numBytes]GroupElementJacobian
 	// Base for byte 31 (LSB): 2^0 * G = G
 	byteBases[31] = gJac
 	// Compute bases for bytes 30 down to 0 (MSB)
 	// byteBases[i] = 2^(8*(31-i)) * G
 	for i := numBytes - 2; i >= 0; i-- {
 		// byteBases[i] = byteBases[i+1] * 2^8
 		byteBases[i] = byteBases[i+1]
 		for j := 0; j < 8; j++ {
 			byteBases[i].double(&byteBases[i])
 		}
 	}
 	// Now compute all byte points for each byte position
 	for byteNum := 0; byteNum < numBytes; byteNum++ {
 		base := byteBases[byteNum]
 		// Convert base to affine for efficiency
 		var baseAff GroupElementAffine
 		baseAff.setGEJ(&base)
 		// bytePoints[byteNum][0] = infinity (point at infinity)
 		// We'll skip this and handle it in the lookup
 		// bytePoints[byteNum][1] = base
 		var ptJac GroupElementJacobian
 		ptJac.setGE(&baseAff)
 		var ptAff GroupElementAffine
 		ptAff.setGEJ(&ptJac)
 		ptAff.x.normalize()
 		ptAff.y.normalize()
 		ptAff.x.getB32(ctx.bytePoints[byteNum][1][0][:])
 		ptAff.y.getB32(ctx.bytePoints[byteNum][1][1][:])
 		// Compute bytePoints[byteNum][byteVal] = byteVal * base
 		// We'll use addition to build up multiples
 		var accJac GroupElementJacobian = ptJac
 		var accAff GroupElementAffine
 		for byteVal := 2; byteVal < numByteValues; byteVal++ {
 			// acc = acc + base
 			accJac.addVar(&accJac, &ptJac)
 			accAff.setGEJ(&accJac)
 			accAff.x.normalize()
 			accAff.y.normalize()
 			accAff.x.getB32(ctx.bytePoints[byteNum][byteVal][0][:])
 			accAff.y.getB32(ctx.bytePoints[byteNum][byteVal][1][:])
 		}
 	}
 	ctx.initialized = true
 }
 // getGlobalGenContext returns the global precomputed context
 func getGlobalGenContext() *EcmultGenContext {
 	genContextOnce.Do(func() {
 		globalGenContext = &EcmultGenContext{}
 		globalGenContext.initGenContext()
 	})
 	return globalGenContext
 }
 // NewEcmultGenContext creates a new generator multiplication context
 func NewEcmultGenContext() *EcmultGenContext {
-	return &EcmultGenContext{
+	ctx := &EcmultGenContext{}
-		initialized: true,
+	ctx.initGenContext()
-	}
+	return ctx
 }
 // ecmultGen computes r = n * G where G is the generator point
-// This is a simplified implementation - the real version would use precomputed tables
+// Uses 8-bit byte-based lookup table (like btcec) for maximum efficiency
 func (ctx *EcmultGenContext) ecmultGen(r *GroupElementJacobian, n *Scalar) {
 	if !ctx.initialized {
 		panic("ecmult_gen context not initialized")
@@ -33,36 +132,44 @@ func (ctx *EcmultGenContext) ecmultGen(r *GroupElementJacobian, n *Scalar) {
 		return
 	}
-	// Simple binary method for now (not optimal but correct)
+	// Byte-based method: process one byte at a time (MSB to LSB)
-	// Real implementation would use precomputed tables and windowing
+	// For each byte, lookup the precomputed point and add it
 	r.setInfinity()
-	var base GroupElementJacobian
+	// Get scalar bytes (MSB to LSB)
-	base.setGE(&Generator)
+	var scalarBytes [32]byte
 	n.getB32(scalarBytes[:])
-	// Process each bit of the scalar
+	for byteNum := 0; byteNum < numBytes; byteNum++ {
-	for i := 0; i < 256; i++ {
+		byteVal := scalarBytes[byteNum]
-		// Double the accumulator
+
-		if i > 0 {
+		// Skip zero bytes
-			r.double(r)
+		if byteVal == 0 {
 			continue
 		}
-		// Extract bit i from scalar (from MSB)
+		// Lookup precomputed point for this byte
-		bit := n.getBits(uint(255-i), 1)
+		var ptAff GroupElementAffine
-		if bit != 0 {
+		var xFe, yFe FieldElement
-			if r.isInfinity() {
+		xFe.setB32(ctx.bytePoints[byteNum][byteVal][0][:])
-				*r = base
+		yFe.setB32(ctx.bytePoints[byteNum][byteVal][1][:])
-			} else {
+		ptAff.setXY(&xFe, &yFe)
-				r.addVar(r, &base)
+
-			}
+		// Convert to Jacobian and add
 		var ptJac GroupElementJacobian
 		ptJac.setGE(&ptAff)
 		if r.isInfinity() {
 			*r = ptJac
 		} else {
 			r.addVar(r, &ptJac)
 		}
 	}
 }
 // EcmultGen is the public interface for generator multiplication
 func EcmultGen(r *GroupElementJacobian, n *Scalar) {
-	// Use a default context for now
+	// Use global precomputed context for efficiency
-	// In a real implementation, this would use a global precomputed context
+	ctx := getGlobalGenContext()
 	ctx := NewEcmultGenContext()
 	ctx.ecmultGen(r, n)
 }
--- a/schnorr.go
+++ b/schnorr.go
@@ -246,8 +246,12 @@ func SchnorrVerify(sig64 []byte, msg32 []byte, xonlyPubkey *XOnlyPubkey) bool {
 		return false
 	}
 	// Use optimized variable-time multiplication for verification
 	// (constant-time is not required for public verification operations)
 	var pkJac GroupElementJacobian
 	pkJac.setGE(&pk)
 	var eP GroupElementJacobian
-	EcmultConst(&eP, &pk, &e)
+	Ecmult(&eP, &pkJac, &e)
 	// Negate eP
 	var negEP GroupElementJacobian
--- a/signer/p256k1_signer.go
+++ b/signer/p256k1_signer.go
@@ -76,13 +76,15 @@ func (s *P256K1Signer) InitSec(sec []byte) error {
 		return err
 	}
-	// If parity is 1 (odd Y), negate the secret key
+	// If parity is 1 (odd Y), negate the secret key and recompute public key
 	// With windowed optimization, this is now much faster than before
 	if parity == 1 {
 		seckey := kp.Seckey()
 		if !p256k1.ECSeckeyNegate(seckey) {
 			return errors.New("failed to negate secret key")
 		}
 		// Recreate keypair with negated secret key
 		// This is now optimized with windowed precomputed tables
 		kp, err = p256k1.KeyPairCreate(seckey)
 		if err != nil {
 			return err