initial addition of essential crypto, encoders, workflows and LLM instructions

2025-08-20 05:47:06 +01:00
parent f449a9d415
commit b8db587d7b
159 changed files with 36993 additions and 10 deletions
--- a/pkg/crypto/sha256/README.md
+++ b/pkg/crypto/sha256/README.md
@@ -0,0 +1,197 @@
+# sha256-simd
+
+Accelerate SHA256 computations in pure Go using AVX512, SHA Extensions for x86
+and ARM64 for ARM.
+On AVX512 it provides an up to 8x improvement (over 3 GB/s per core).
+SHA Extensions give a performance boost of close to 4x over native.
+
+## Introduction
+
+This package is designed as a replacement for `crypto/sha256`.
+For ARM CPUs with the Cryptography Extensions, advantage is taken of the SHA2
+instructions resulting in a massive performance improvement.
+
+This package uses Golang assembly.
+The AVX512 version is based on the Intel's "multi-buffer crypto library for
+IPSec" whereas the other Intel implementations are described in "Fast SHA-256
+Implementations on Intel Architecture Processors" by J. Guilford et al.
+
+## Support for Intel SHA Extensions
+
+Support for the Intel SHA Extensions has been added by Kristofer Peterson (
+@svenski123), originally developed for
+spacemeshos [here](https://github.com/spacemeshos/POET/issues/23). On CPUs that
+support it (known thus far Intel Celeron J3455 and AMD Ryzen) it gives a
+significant boost in performance (with thanks to @AudriusButkevicius for
+reporting the results; full
+results [here](https://github.com/minio/sha256-simd/pull/37#issuecomment-451607827)).
+
+```
+$ benchcmp avx2.txt sha-ext.txt
+benchmark           AVX2 MB/s    SHA Ext MB/s  speedup
+BenchmarkHash5M     514.40       1975.17       3.84x
+```
+
+Thanks to Kristofer Peterson, we also added additional performance changes such
+as optimized padding,
+endian conversions which sped up all implementations i.e. Intel SHA alone while
+doubled performance for small sizes,
+the other changes increased everything roughly 50%.
+
+## Support for AVX512
+
+We have added support for AVX512 which results in an up to 8x performance
+improvement over AVX2 (3.0 GHz Xeon Platinum 8124M CPU):
+
+```
+$ benchcmp avx2.txt avx512.txt
+benchmark           AVX2 MB/s    AVX512 MB/s  speedup
+BenchmarkHash5M     448.62       3498.20      7.80x
+```
+
+The original code was developed by Intel as part of
+the [multi-buffer crypto library](https://github.com/intel/intel-ipsec-mb) for
+IPSec or more specifically
+this [AVX512](https://github.com/intel/intel-ipsec-mb/blob/master/avx512/sha256_x16_avx512.asm)
+implementation. The key idea behind it is to process a total of 16 checksums in
+parallel by “transposing” 16 (independent) messages of 64 bytes between a total
+of 16 ZMM registers (each 64 bytes wide).
+
+Transposing the input messages means that in order to take full advantage of the
+speedup you need to have a (server) workload where multiple threads are doing
+SHA256 calculations in parallel. Unfortunately for this algorithm it is not
+possible for two message blocks processed in parallel to be dependent on one
+another — because then the (interim) result of the first part of the message has
+to be an input into the processing of the second part of the message.
+
+Whereas the original Intel C implementation requires some sort of explicit
+scheduling of messages to be processed in parallel, for Golang it makes sense to
+take advantage of channels in order to group messages together and use channels
+as well for sending back the results (thereby effectively decoupling the
+calculations). We have implemented a fairly simple scheduling mechanism that
+seems to work well in practice.
+
+Due to this different way of scheduling, we decided to use an explicit method to
+instantiate the AVX512 version. Essentially one or more AVX512 processing
+servers ([
+`Avx512Server`](https://github.com/minio/sha256-simd/blob/master/sha256blockAvx512_amd64.go#L294))
+have to be created whereby each server can hash over 3 GB/s on a single core. An
+`hash.Hash` object ([
+`Avx512Digest`](https://github.com/minio/sha256-simd/blob/master/sha256blockAvx512_amd64.go#L45))
+is then instantiated using one of these servers and used in the regular fashion:
+
+```go
+import "mleku.dev/pkg/sha256"
+
+func main() {
+	server := sha256.NewAvx512Server()
+	h512 := sha256.NewAvx512(server)
+	h512.Write(fileBlock)
+	digest := h512.Sum([]byte{})
+}
+```
+
+Note that, because of the scheduling overhead, for small messages (< 1 MB) you
+will be better off using the regular SHA256 hashing (but those are typically not
+performance critical anyway). Some other tips to get the best performance:
+
+* Have many go routines doing SHA256 calculations in parallel.
+* Try to Write() messages in multiples of 64 bytes.
+* Try to keep the overall length of messages to a roughly similar size ie. 5
+  MB (this way all 16 ‘lanes’ in the AVX512 computations are contributing as
+  much as possible).
+
+More detailed information can be found in
+this [blog](https://blog.minio.io/accelerate-sha256-up-to-8x-over-3-gb-s-per-core-with-avx512-a0b1d64f78f)
+post including scaling across cores.
+
+## Drop-In Replacement
+
+The following code snippet shows how you can use `github.com/minio/sha256-simd`.
+This will automatically select the fastest method for the architecture on which
+it will be executed.
+
+```go
+import "github.com/minio/sha256-simd"
+
+func main() {
+        ...
+	shaWriter := sha256.New()
+	io.Copy(shaWriter, file)
+        ...
+}
+```
+
+## Performance
+
+Below is the speed in MB/s for a single core (ranked fast to slow) for blocks
+larger than 1 MB.
+
+| Processor                         | SIMD    | Speed (MB/s) |
+|-----------------------------------|---------|-------------:|
+| 3.0 GHz Intel Xeon Platinum 8124M | AVX512  |         3498 |
+| 3.7 GHz AMD Ryzen 7 2700X         | SHA Ext |         1979 |
+| 1.2 GHz ARM Cortex-A53            | ARM64   |          638 |
+
+## asm2plan9s
+
+In order to be able to work more easily with AVX512/AVX2 instructions, a
+separate tool was developed to convert SIMD instructions into the corresponding
+BYTE sequence as accepted by Go assembly.
+See [asm2plan9s](https://github.com/minio/asm2plan9s) for more information.
+
+## Why and benefits
+
+One of the most performance sensitive parts of
+the [Minio](https://github.com/minio/minio) object storage server is related to
+SHA256 hash sums calculations. For instance during multi part uploads each part
+that is uploaded needs to be verified for data integrity by the server.
+
+Other applications that can benefit from enhanced SHA256 performance are
+deduplication in storage systems, intrusion detection, version control systems,
+integrity checking, etc.
+
+## ARM SHA Extensions
+
+The 64-bit ARMv8 core has introduced new instructions for SHA1 and SHA2
+acceleration as part of
+the [Cryptography Extensions](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0501f/CHDFJBCJ.html).
+Below you can see a small excerpt highlighting one of the rounds as is done for
+the SHA256 calculation process (for full code
+see [sha256block_arm64.s](https://github.com/minio/sha256-simd/blob/master/sha256block_arm64.s)).
+
+ ```
+ sha256h    q2, q3, v9.4s
+ sha256h2   q3, q4, v9.4s
+ sha256su0  v5.4s, v6.4s
+ rev32      v8.16b, v8.16b
+ add        v9.4s, v7.4s, v18.4s
+ mov        v4.16b, v2.16b
+ sha256h    q2, q3, v10.4s
+ sha256h2   q3, q4, v10.4s
+ sha256su0  v6.4s, v7.4s
+ sha256su1  v5.4s, v7.4s, v8.4s
+ ```
+
+### Detailed benchmarks
+
+Benchmarks generated on a 1.2 Ghz Quad-Core ARM Cortex A53
+equipped [Pine64](https://www.pine64.com/).
+
+```
+minio@minio-arm:$ benchcmp golang.txt arm64.txt
+benchmark                 golang         arm64        speedup
+BenchmarkHash8Bytes-4     0.68 MB/s      5.70 MB/s      8.38x
+BenchmarkHash1K-4         5.65 MB/s    326.30 MB/s     57.75x
+BenchmarkHash8K-4         6.00 MB/s    570.63 MB/s     95.11x
+BenchmarkHash1M-4         6.05 MB/s    638.23 MB/s    105.49x
+```
+
+## License
+
+Released under the Apache License v2.0. You can find the complete text in the
+file LICENSE.
+
+## Contributing
+
+Contributions are welcome, please send PRs for any enhancements.