p256k1/README.md

# secp256k1 Go Implementation

This package provides a pure Go implementation of the secp256k1 elliptic curve cryptographic primitives, ported from the libsecp256k1 C library.

## Features Implemented

### ✅ Core Components
- **Field Arithmetic** (`field.go`, `field_mul.go`): Complete implementation of field operations modulo the secp256k1 field prime (2^256 - 2^32 - 977)
  - 5x52-bit limb representation for efficient arithmetic
  - Addition, multiplication, squaring, inversion operations
  - Constant-time normalization and magnitude management

- **Scalar Arithmetic** (`scalar.go`): Complete implementation of scalar operations modulo the group order
  - 4x64-bit limb representation
  - Addition, multiplication, inversion, negation operations
  - Proper overflow handling and reduction

- **Group Operations** (`group.go`): Elliptic curve point operations
  - Affine and Jacobian coordinate representations
  - Point addition, doubling, negation
  - Coordinate conversion between representations

- **Context Management** (`context.go`): Context objects for enhanced security
  - Context creation, cloning, destruction
  - Randomization for side-channel protection
  - Callback management for error handling

- **Main API** (`secp256k1.go`): Core secp256k1 API functions
  - Public key parsing, serialization, and comparison
  - ECDSA signature parsing and serialization
  - Key generation and verification
  - Basic ECDSA signing and verification (simplified implementation)

- **Utilities** (`util.go`): Helper functions and constants
  - Memory management utilities
  - Endianness conversion functions
  - Bit manipulation utilities
  - Error handling and callbacks

### ✅ Testing
- Comprehensive test suite (`secp256k1_test.go`) covering:
  - Basic functionality and self-tests
  - Field element operations
  - Scalar operations
  - Key generation
  - Signature operations
  - Public key operations
  - Performance benchmarks

## Usage

```go
package main

import (
    "fmt"
    "crypto/rand"
    p256k1 "p256k1.mleku.dev/pkg"
)

func main() {
    // Create context
    ctx, err := p256k1.ContextCreate(p256k1.ContextNone)
    if err != nil {
        panic(err)
    }
    defer p256k1.ContextDestroy(ctx)

    // Generate secret key
    var seckey [32]byte
    rand.Read(seckey[:])

    // Verify secret key
    if !p256k1.ECSecKeyVerify(ctx, seckey[:]) {
        panic("Invalid secret key")
    }

    // Create public key
    var pubkey p256k1.PublicKey
    if !p256k1.ECPubkeyCreate(ctx, &pubkey, seckey[:]) {
        panic("Failed to create public key")
    }

    fmt.Println("Successfully created secp256k1 key pair!")
}
```

## Architecture

The implementation follows the same architectural patterns as libsecp256k1:

1. **Layered Design**: Low-level field/scalar arithmetic → Group operations → High-level API
2. **Constant-Time Operations**: Designed to prevent timing side-channel attacks
3. **Magnitude Tracking**: Field elements track their "magnitude" to optimize operations
4. **Context Objects**: Encapsulate state and provide enhanced security features

## Performance

Benchmark results on AMD Ryzen 5 PRO 4650G:
- Field Addition: ~2.4 ns/op
- Scalar Multiplication: ~9.9 ns/op

## AVX2 Acceleration Opportunities

The Scalar and FieldElement types and their operations are designed with data layouts that are amenable to AVX2 SIMD acceleration:

### Scalar Type (`scalar.go`)
- **Representation**: 4×64-bit limbs (`[4]uint64`) representing 256-bit scalars
- **AVX2-Acceleratable Operations**:
  - `scalarAdd` / `scalarMul`: 256-bit integer arithmetic using `VPADDD/Q`, `VPMULUDQ`
  - `mul512`: Full 512-bit product computation - can use AVX2's 256-bit registers to process limb pairs in parallel
  - `reduce512`: Modular reduction with Montgomery-style operations
  - `wNAF`: Window Non-Adjacent Form conversion for scalar multiplication
  - `splitLambda`: GLV endomorphism scalar splitting

### FieldElement Type (`field.go`, `field_mul.go`)
- **Representation**: 5×52-bit limbs (`[5]uint64`) in base 2^52 for efficient multiplication
- **AVX2-Acceleratable Operations**:
  - `mul` / `sqr`: Field multiplication/squaring using 128-bit intermediate products
  - `normalize` / `normalizeWeak`: Carry propagation across limbs
  - `add` / `negate`: Parallel limb operations ideal for `VPADDQ`, `VPSUBQ`
  - `inv`: Modular inversion via Fermat's little theorem (chain of sqr/mul)
  - `sqrt`: Square root computation using addition chains

### Affine/Jacobian Group Operations (`group.go`)
- **Types**: `GroupElementAffine` (x, y coordinates), `GroupElementJacobian` (x, y, z coordinates)
- **AVX2-Acceleratable Operations**:
  - `double`: Point doubling - multiple independent field operations
  - `addVar` / `addGE`: Point addition - parallelizable field multiplications
  - `setGEJ`: Coordinate conversion with batch field inversions

### Key AVX2 Instructions for Implementation

| Operation | Relevant AVX2 Instructions |
|-----------|---------------------------|
| 128-bit limb add | `VPADDQ` (packed 64-bit add) with carry chain |
| Limb multiplication | `VPMULUDQ` (unsigned 32×32→64), `VPCLMULQDQ` (carryless multiply) |
| 128-bit arithmetic | `VPMULLD`, `VPMULUDQ` for multi-precision products |
| Carry propagation | `VPSRLQ`/`VPSLLQ` (shift), `VPAND` (mask), `VPALIGNR` |
| Conditional moves | `VPBLENDVB` (blend based on mask) |
| Data movement | `VMOVDQU` (unaligned load/store), `VBROADCASTI128` |

### 128-bit Limb Representation with AVX2

AVX2's 256-bit YMM registers can natively hold two 128-bit limbs, enabling more efficient representations:

**Scalar (256-bit) with 2×128-bit limbs:**
```
YMM0 = [scalar.d[1]:scalar.d[0]] | [scalar.d[3]:scalar.d[2]]
       ├── 128-bit limb 0 ───────┤ ├── 128-bit limb 1 ───────┤
```
- A single 256-bit scalar fits in one YMM register as two 128-bit limbs
- Addition/subtraction can use `VPADDQ` with manual carry handling between 64-bit halves
- The 4×64-bit representation naturally maps to 2×128-bit by treating pairs

**FieldElement (260-bit effective) with 128-bit limbs:**
```
YMM0 = [fe.n[0]:fe.n[1]] (lower 104 bits used per pair)
YMM1 = [fe.n[2]:fe.n[3]]
XMM2 = [fe.n[4]:0]       (upper 48 bits)
```
- 5×52-bit limbs can be reorganized into 3×128-bit containers
- Multiplication benefits from `VPMULUDQ` processing two 64×64→128 products simultaneously

**512-bit Intermediate Products:**
- Scalar multiplication produces 512-bit intermediates
- Two YMM registers hold the full product: `YMM0 = [l[1]:l[0]], YMM1 = [l[3]:l[2]], YMM2 = [l[5]:l[4]], YMM3 = [l[7]:l[6]]`
- Reduction can proceed in parallel across register pairs

### Implementation Approach

AVX2 acceleration can be added via Go assembly (`.s` files) using the patterns described in `AVX.md`:

```go
//go:build amd64

package p256k1

// FieldMulAVX2 multiplies two field elements using AVX2
// Uses 128-bit limb operations for ~2x throughput
//go:noescape
func FieldMulAVX2(r, a, b *FieldElement)

// ScalarMulAVX2 multiplies two scalars using AVX2
// Processes scalar as 2×128-bit limbs in a single YMM register
//go:noescape
func ScalarMulAVX2(r, a, b *Scalar)

// ScalarAdd256AVX2 adds two 256-bit scalars using 128-bit limb arithmetic
//go:noescape
func ScalarAdd256AVX2(r, a, b *Scalar) bool
```

The key insight is that AVX2's 256-bit registers holding 128-bit limb pairs enable:
- **2x parallelism** for addition/subtraction across limb pairs
- **Efficient carry chains** using `VPSRLQ` to extract carries and `VPADDQ` to propagate
- **Reduced loop iterations** for multi-precision arithmetic (2 iterations for 256-bit instead of 4)

## Implementation Status

### ✅ Completed
- Core field and scalar arithmetic
- Basic group operations
- Context management
- Main API structure
- Key generation and verification
- Basic signature operations
- Comprehensive test suite

### 🚧 Simplified/Placeholder
- **ECDSA Implementation**: Basic structure in place, but signing/verification uses simplified algorithms
- **Field Multiplication**: Uses simplified approach instead of optimized assembly
- **Point Validation**: Curve equation checking is simplified
- **Nonce Generation**: Uses crypto/rand instead of RFC 6979

### ❌ Not Yet Implemented
- **Hash Functions**: SHA-256 and tagged hash implementations
- **Optimized Multiplication**: Full constant-time field multiplication
- **Precomputed Tables**: Optimized scalar multiplication with precomputed points
- **Optional Modules**: Schnorr signatures, ECDH, extra keys
- **Recovery**: Public key recovery from signatures
- **Complete ECDSA**: Full constant-time ECDSA implementation

## Security Considerations

⚠️ **This implementation is for educational/development purposes and should not be used in production without further security review and completion of the cryptographic implementations.**

Key security features implemented:
- Constant-time field operations (basic level)
- Magnitude tracking to prevent overflows
- Memory clearing for sensitive data
- Context randomization support

Key security features still needed:
- Complete constant-time ECDSA implementation
- Proper nonce generation (RFC 6979)
- Side-channel resistance verification
- Comprehensive security testing

## Building and Testing

```bash
cd pkg/
go test -v          # Run all tests
go test -bench=.    # Run benchmarks
go build            # Build the package
```

## License

This implementation is derived from libsecp256k1 and maintains the same MIT license.