This changes the mmap strategy used in the compiler backend.
Previously, we used mmap syscall once per function and allocated the
executable pages each time. Basically, mmap can only allocate the
boundary of the page size of the underlying os. Even if the requested
executable is smaller than the page size, the entire page is marked as
executable and won't be reused by Go runtime. Therefore, we wasted
roughly `(len(body)%osPageSize)*function`.
Even though we still need to align each function on 16 bytes boundary
when mmaping per module, the wasted space is much smaller than before.
The following benchmark results shows that this improves the overall
compilation performance while showing the heap usage increased.
However, the increased heap usage is totally offset by the hidden wasted
memory page which is not measured by Go's -benchmem.
Actually, when I did the experiments, I observed that roughly 20~30mb are
wasted on arm64 previously which is larger than the increased heap usage
in this result. More importantly, this increased heap usage is a target of GC
and should be ignorable in the long-running program vs the wasted page
is persistent until the CompiledModule is closed.
Not only the actual compilation time, the result indicates that this could
improve the overall Go runtime's performance maybe thanks to not abusing
runtime.Finalizer since you can see this improves the subsequent interpreter
benchmark results.
```
goos: darwin
goarch: arm64
pkg: github.com/tetratelabs/wazero/internal/integration_test/bench
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Compilation_sqlite3/compiler-10 183.4m ± 0% 175.9m ± 2% -4.10% (p=0.001 n=7)
Compilation_sqlite3/interpreter-10 61.59m ± 0% 59.57m ± 0% -3.29% (p=0.001 n=7)
geomean 106.3m 102.4m -3.69%
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
Compilation_sqlite3/compiler-10 42.93Mi ± 0% 54.33Mi ± 0% +26.56% (p=0.001 n=7)
Compilation_sqlite3/interpreter-10 51.75Mi ± 0% 51.75Mi ± 0% -0.01% (p=0.001 n=7)
geomean 47.13Mi 53.02Mi +12.49%
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
Compilation_sqlite3/compiler-10 26.07k ± 0% 26.06k ± 0% ~ (p=0.149 n=7)
Compilation_sqlite3/interpreter-10 13.90k ± 0% 13.90k ± 0% ~ (p=0.421 n=7)
geomean 19.03k 19.03k -0.02%
goos: linux
goarch: amd64
pkg: github.com/tetratelabs/wazero/internal/integration_test/bench
cpu: AMD Ryzen 9 3950X 16-Core Processor
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Compilation_sqlite3/compiler-32 384.4m ± 2% 373.0m ± 4% -2.97% (p=0.001 n=7)
Compilation_sqlite3/interpreter-32 86.09m ± 4% 65.05m ± 2% -24.44% (p=0.001 n=7)
geomean 181.9m 155.8m -14.38%
│ old.txt │ new.txt │
│ B/op │ B/op vs base │
Compilation_sqlite3/compiler-32 49.40Mi ± 0% 59.91Mi ± 0% +21.29% (p=0.001 n=7)
Compilation_sqlite3/interpreter-32 51.77Mi ± 0% 51.76Mi ± 0% -0.02% (p=0.001 n=7)
geomean 50.57Mi 55.69Mi +10.12%
│ old.txt │ new.txt │
│ allocs/op │ allocs/op vs base │
Compilation_sqlite3/compiler-32 28.70k ± 0% 28.70k ± 0% ~ (p=0.925 n=7)
Compilation_sqlite3/interpreter-32 14.00k ± 0% 14.00k ± 0% -0.04% (p=0.010 n=7)
geomean 20.05k 20.04k -0.02%
```
resolves#1060
Signed-off-by: Takeshi Yoneda <takeshi@tetrate.io>
This switches to gofumpt and applies changes, as I've noticed working
in dapr (who uses this) that it finds some things that are annoying,
such as inconsistent block formatting in test tables.
Signed-off-by: Adrian Cole <adrian@tetrate.io>
This simplifies the calling convention and consolidates the call frame stack
and value stack into a single stack.
As a result, the cost of function calls decreases because we now don't need
to check the boundary twice (value and call frame stacks) at each function call.
The following is the result of the benchmark for recursive Fibonacci
function in integration_test/bench/testdata/case.go, and it shows that
this actually improves the performance of function calls.
[amd64]
name old time/op new time/op delta
Invocation/compiler/fib_for_5-32 109ns ± 3% 81ns ± 1% -25.86% (p=0.008 n=5+5)
Invocation/compiler/fib_for_10-32 556ns ± 3% 473ns ± 3% -14.99% (p=0.008 n=5+5)
Invocation/compiler/fib_for_20-32 61.4µs ± 2% 55.9µs ± 5% -8.98% (p=0.008 n=5+5)
Invocation/compiler/fib_for_30-32 7.41ms ± 3% 6.83ms ± 3% -7.90% (p=0.008 n=5+5)
[arm64]
name old time/op new time/op delta
Invocation/compiler/fib_for_5-10 67.7ns ± 1% 60.2ns ± 1% -11.12% (p=0.000 n=9+9)
Invocation/compiler/fib_for_10-10 487ns ± 1% 460ns ± 0% -5.56% (p=0.000 n=10+9)
Invocation/compiler/fib_for_20-10 58.0µs ± 1% 54.3µs ± 1% -6.38% (p=0.000 n=10+10)
Invocation/compiler/fib_for_30-10 7.12ms ± 1% 6.67ms ± 1% -6.31% (p=0.000 n=10+9)
Signed-off-by: Takeshi Yoneda <takeshi@tetrate.io>
This removes the embedding of pointers of jump tables (uintptr of []byte)
used by BrTable operations. That is the last usage of unsafe.Pointer in
compiler implementations.
Alternatively, we treat jump tables as asm.StaticConst and emit them
into the constPool already implemented and used by various places.
Notably, now the native code compiled by compilers can be reusable
across multiple processes, meaning that they are independent of
any runtime pointers.
Signed-off-by: Takeshi Yoneda <takeshi@tetrate.io>
This notably changes NewRuntimeJIT to NewRuntimeCompiler as well renames
packages from jit to compiler.
This clarifies the implementation is AOT, not JIT, at least when
clarified to where it occurs (Runtime.CompileModule). In doing so, we
reduce any concern that compilation will happen during function
execution. We also free ourselves to create a JIT option without
confusion in the future via CompileConfig or otherwise.
Fixes#560
Signed-off-by: Adrian Cole <adrian@tetrate.io>