wazevo(docs): optimizing compiler (#2065)
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
This commit is contained in:
@@ -143,7 +143,8 @@ Notably, the interpreter and compiler in wazero's [Runtime configuration][Runtim
|
|||||||
In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT)
|
In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT)
|
||||||
during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute
|
during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute
|
||||||
natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of
|
natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of
|
||||||
magnitude (10x) or more, and therefore enabled by default whenever available.
|
magnitude (10x) or more, and therefore enabled by default whenever available. You can read more about wazero's
|
||||||
|
[optimizing compiler in the detailed documentation]({{< relref "/how_the_optimizing_compiler_works" >}}).
|
||||||
|
|
||||||
#### Interpreter
|
#### Interpreter
|
||||||
|
|
||||||
|
|||||||
131
site/content/docs/how_the_optimizing_compiler_works/_index.md
Normal file
131
site/content/docs/how_the_optimizing_compiler_works/_index.md
Normal file
@@ -0,0 +1,131 @@
|
|||||||
|
+++
|
||||||
|
title = "How the Optimizing Compiler Works"
|
||||||
|
layout = "single"
|
||||||
|
+++
|
||||||
|
|
||||||
|
wazero supports two modes of execution: interpreter mode and compilation mode.
|
||||||
|
The interpreter mode is a fallback mode for platforms where compilation is not
|
||||||
|
supported. Compilation mode is otherwise the default mode of execution: it
|
||||||
|
translates Wasm modules to native code to get the best run-time performance.
|
||||||
|
|
||||||
|
Translating Wasm bytecode into machine code can take multiple forms. wazero
|
||||||
|
1.0 performs a straightforward translation from a given instruction to a native
|
||||||
|
instruction. wazero 2.0 introduces an optimizing compiler that is able to
|
||||||
|
perform nontrivial optimizing transformations, such as constant folding or
|
||||||
|
dead-code elimination, and it makes better use of the underlying hardware, such
|
||||||
|
as CPU registers. This document digs deeper into what we mean when we say
|
||||||
|
"optimizing compiler", and explains how it is implemented in wazero.
|
||||||
|
|
||||||
|
This document is intended for maintainers, researchers, developers and in
|
||||||
|
general anyone interested in understanding the internals of wazero.
|
||||||
|
|
||||||
|
What is an Optimizing Compiler?
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
Wazero supports an _optimizing_ compiler in the style of other optimizing
|
||||||
|
compilers such as LLVM's or V8's. Traditionally an optimizing
|
||||||
|
compiler performs compilation in a number of steps.
|
||||||
|
|
||||||
|
Compare this to the **old compiler**, where compilation happens in one step or
|
||||||
|
two, depending on how you count:
|
||||||
|
|
||||||
|
|
||||||
|
```goat
|
||||||
|
Input +---------------+ +---------------+
|
||||||
|
Wasm Binary ---->| DecodeModule |---->| CompileModule |----> wazero IR
|
||||||
|
+---------------+ +---------------+
|
||||||
|
```
|
||||||
|
|
||||||
|
That is, the module is (1) validated then (2) translated to an Intermediate
|
||||||
|
Representation (IR). The wazero IR can then be executed directly (in the case
|
||||||
|
of the interpreter) or it can be further processed and translated into native
|
||||||
|
code by the compiler. This compiler performs a straightforward translation from
|
||||||
|
the IR to native code, without any further passes. The wazero IR is not intended
|
||||||
|
for further processing beyond immediate execution or straightforward
|
||||||
|
translation.
|
||||||
|
|
||||||
|
```goat
|
||||||
|
+---- wazero IR ----+
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
+--------------+ +--------------+
|
||||||
|
| Compiler | | Interpreter |- - - executable
|
||||||
|
+--------------+ +--------------+
|
||||||
|
|
|
||||||
|
+----------+---------+
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
+---------+ +---------+
|
||||||
|
| ARM64 | | AMD64 |
|
||||||
|
| Backend | | Backend | - - - - - - - - - executable
|
||||||
|
+---------+ +---------+
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Validation and translation to an IR in a compiler are usually called the
|
||||||
|
**front-end** part of a compiler, while code-generation occurs in what we call
|
||||||
|
the **back-end** of a compiler. The front-end is the part of a compiler that is
|
||||||
|
closer to the input, and it generally indicates machine-independent processing,
|
||||||
|
such as parsing and static validation. The back-end is the part of a compiler
|
||||||
|
that is closer to the output, and it generally includes machine-specific
|
||||||
|
procedures, such as code-generation.
|
||||||
|
|
||||||
|
In the **optimizing** compiler, we still decode and translate Wasm binaries to
|
||||||
|
an intermediate representation in the front-end, but we use a textbook
|
||||||
|
representation called an **SSA** or "Static Single-Assignment Form", that is
|
||||||
|
intended for further transformation.
|
||||||
|
|
||||||
|
The benefit of choosing an IR that is meant for transformation is that a lot of
|
||||||
|
optimization passes can apply directly to the IR, and thus be
|
||||||
|
machine-independent. Then the back-end can be relatively simpler, in that it
|
||||||
|
will only have to deal with machine-specific concerns.
|
||||||
|
|
||||||
|
The wazero optimizing compiler implements the following compilation passes:
|
||||||
|
|
||||||
|
* Front-End:
|
||||||
|
- Translation to SSA
|
||||||
|
- Optimization
|
||||||
|
- Block Layout
|
||||||
|
- Control Flow Analysis
|
||||||
|
|
||||||
|
* Back-End:
|
||||||
|
- Instruction Selection
|
||||||
|
- Registry Allocation
|
||||||
|
- Finalization and Encoding
|
||||||
|
|
||||||
|
```goat
|
||||||
|
Input +-------------------+ +-------------------+
|
||||||
|
Wasm Binary --->| DecodeModule |----->| CompileModule |--+
|
||||||
|
+-------------------+ +-------------------+ |
|
||||||
|
+----------------------------------------------------------+
|
||||||
|
|
|
||||||
|
| +---------------+ +---------------+
|
||||||
|
+->| Front-End |----------->| Back-End |
|
||||||
|
+---------------+ +---------------+
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
SSA Instruction Selection
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
Optimization Registry Allocation
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
Block Layout Finalization/Encoding
|
||||||
|
```
|
||||||
|
|
||||||
|
Like the other engines, the implementation can be found under `engine`, specifically
|
||||||
|
in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
|
||||||
|
where the implementation of the interface `wasm.Engine` is found.
|
||||||
|
|
||||||
|
All the passes can be dumped to the console for debugging, by enabling, the build-time
|
||||||
|
flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
|
||||||
|
by default and should only be enabled during debugging. These may also change in the future.
|
||||||
|
|
||||||
|
In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
|
||||||
|
so we will omit the prefix.
|
||||||
|
|
||||||
|
## Index
|
||||||
|
|
||||||
|
- [Front-End](frontend/)
|
||||||
|
- [Back-End](backend/)
|
||||||
|
- [Appendix](appendix/)
|
||||||
185
site/content/docs/how_the_optimizing_compiler_works/appendix.md
Normal file
185
site/content/docs/how_the_optimizing_compiler_works/appendix.md
Normal file
@@ -0,0 +1,185 @@
|
|||||||
|
+++
|
||||||
|
title = "Appendix: Trampolines"
|
||||||
|
layout = "single"
|
||||||
|
+++
|
||||||
|
|
||||||
|
Trampolines are used to interface between the Go runtime and the generated
|
||||||
|
code, in two cases:
|
||||||
|
|
||||||
|
- when we need to **enter the generated code** from the Go runtime.
|
||||||
|
- when we need to **leave the generated code** to invoke a host function
|
||||||
|
(written in Go).
|
||||||
|
|
||||||
|
In this section we want to complete the picture of how a Wasm function gets
|
||||||
|
translated from Wasm to executable code in the optimizing compiler, by
|
||||||
|
describing how to jump into the execution of the generated code at run-time.
|
||||||
|
|
||||||
|
## Entering the Generated Code
|
||||||
|
|
||||||
|
At run-time, user space invokes a Wasm function through the public
|
||||||
|
`api.Function` interface, using methods `Call()` or `CallWithStack()`. The
|
||||||
|
implementation of this method, in turn, eventually invokes an ASM
|
||||||
|
**trampoline**. The signature of this trampoline in Go code is:
|
||||||
|
|
||||||
|
```go
|
||||||
|
func entrypoint(
|
||||||
|
preambleExecutable, functionExecutable *byte,
|
||||||
|
executionContextPtr uintptr, moduleContextPtr *byte,
|
||||||
|
paramResultStackPtr *uint64,
|
||||||
|
goAllocatedStackSlicePtr uintptr)
|
||||||
|
```
|
||||||
|
|
||||||
|
- `preambleExecutable` is a pointer to the generated code for the preamble (see
|
||||||
|
below)
|
||||||
|
- `functionExecutable` is a pointer to the generated code for the function (as
|
||||||
|
described in the previous sections).
|
||||||
|
- `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
|
||||||
|
struct. This struct is used to save the state of the Go runtime before
|
||||||
|
entering or leaving the generated code. It also holds shared state between the
|
||||||
|
Go runtime and the generated code, such as the exit code that is used to
|
||||||
|
terminate execution on failure, or suspend it to invoke host functions.
|
||||||
|
- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
|
||||||
|
This struct Its contents are basically the pointers to the module instance,
|
||||||
|
specific objects as well as functions. This is sometimes called "VMContext" in
|
||||||
|
other Wasm runtimes.
|
||||||
|
- `paramResultStackPtr` is a pointer to the slice where the arguments and
|
||||||
|
results of the function are passed.
|
||||||
|
- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
|
||||||
|
for holding values and call frames. For further details refer to
|
||||||
|
[Backend § Prologue and Epilogue](../backend/#prologue-and-epilogue)
|
||||||
|
|
||||||
|
The trampoline can be found in`backend/isa/<arch>/abi_entry_<arch>.s`.
|
||||||
|
|
||||||
|
For each given architecture, the trampoline:
|
||||||
|
- moves the arguments to specific registers to match the behavior of the entry preamble or trampoline function, and
|
||||||
|
- finally, it jumps into the execution of the generated code for the preamble
|
||||||
|
|
||||||
|
The **preamble** that will be jumped from `entrypoint` function is generated per function signature.
|
||||||
|
|
||||||
|
This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
|
||||||
|
|
||||||
|
The preamble sets the fields in the `wazevo.executionContext`.
|
||||||
|
|
||||||
|
At the beginning of the preamble:
|
||||||
|
|
||||||
|
- Set a register to point to the `*wazevo.executionContext` struct.
|
||||||
|
- Save the stack pointers, frame pointers, return addresses, etc. to that
|
||||||
|
struct.
|
||||||
|
- Update the stack pointer to point to `paramResultStackPtr`.
|
||||||
|
|
||||||
|
The generated code works in concert with the assumption that the preamble has
|
||||||
|
been entered through the aforementioned trampoline. Thus, it assumes that the
|
||||||
|
arguments can be found in some specific registers.
|
||||||
|
|
||||||
|
The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
|
||||||
|
the registers and stack location that the generated code expects.
|
||||||
|
|
||||||
|
Finally, it invokes the generated code for the function.
|
||||||
|
|
||||||
|
The epilogue reverses part of the process, finally returning control to the
|
||||||
|
caller of the `entrypoint()` function, and the Go runtime. The caller of
|
||||||
|
`entrypoint()` is also responsible for completing the cleaning up procedure by
|
||||||
|
invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
|
||||||
|
backend-specific ASM). which will restore the stack pointers and return
|
||||||
|
control to the caller of the function.
|
||||||
|
|
||||||
|
The arch-specific code can be found in
|
||||||
|
`backend/isa/<arch>/abi_entry_preamble.go`.
|
||||||
|
|
||||||
|
[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
|
||||||
|
[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
|
||||||
|
[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
|
||||||
|
[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
|
||||||
|
|
||||||
|
|
||||||
|
## Leaving the Generated Code
|
||||||
|
|
||||||
|
In "[How do compiler functions work?][how-do-compiler-functions-work]", we
|
||||||
|
already outlined how _leaving_ the generated code works with the help of a
|
||||||
|
function. We will complete here the picture by briefly describing the code that
|
||||||
|
is generated.
|
||||||
|
|
||||||
|
When the generated code needs to return control to the Go runtime, it inserts a
|
||||||
|
meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
|
||||||
|
backends. This meta-instruction sets the `exitCode` in the
|
||||||
|
`wazevo.executionContext` struct, restore the stack pointers and then returns
|
||||||
|
control to the caller of the `entrypoint()` function described above.
|
||||||
|
|
||||||
|
As described in "[How do compiler functions
|
||||||
|
work?][how-do-compiler-functions-work]", the mechanism is essentially the same
|
||||||
|
when invoking a host function or raising an error. However, when a function is
|
||||||
|
invoked the `exitCode` also indicates the identifier of the host function to be
|
||||||
|
invoked.
|
||||||
|
|
||||||
|
The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
|
||||||
|
method. This method is actually invoked when host modules are being
|
||||||
|
instantiated. It generates a trampoline that is used to invoke such functions
|
||||||
|
from the generated code.
|
||||||
|
|
||||||
|
This trampoline implements essentially the same prologue as the `entrypoint()`,
|
||||||
|
but it also reserves space for the arguments and results of the function to be
|
||||||
|
invoked.
|
||||||
|
|
||||||
|
A host function has the signature:
|
||||||
|
|
||||||
|
```
|
||||||
|
func(ctx context.Context, stack []uint64)
|
||||||
|
```
|
||||||
|
|
||||||
|
the function arguments in the `stack` parameter are copied over to the reserved
|
||||||
|
slots of the real stack. For instance, on `arm64` the stack layout would look
|
||||||
|
as follows (on `amd64` it would be similar):
|
||||||
|
|
||||||
|
```goat
|
||||||
|
(high address)
|
||||||
|
SP ------> +-----------------+ <----+
|
||||||
|
| ....... | |
|
||||||
|
| ret Y | |
|
||||||
|
| ....... | |
|
||||||
|
| ret 0 | |
|
||||||
|
| arg X | | size_of_arg_ret
|
||||||
|
| ....... | |
|
||||||
|
| arg 1 | |
|
||||||
|
| arg 0 | <----+ <-------- originalArg0Reg
|
||||||
|
| size_of_arg_ret |
|
||||||
|
| ReturnAddress |
|
||||||
|
+-----------------+ <----+
|
||||||
|
| xxxx | | ;; might be padded to make it 16-byte aligned.
|
||||||
|
+--->| arg[N]/ret[M] | |
|
||||||
|
sliceSize| | ............ | | goCallStackSize
|
||||||
|
| | arg[1]/ret[1] | |
|
||||||
|
+--->| arg[0]/ret[0] | <----+ <-------- arg0ret0AddrReg
|
||||||
|
| sliceSize |
|
||||||
|
| frame_size |
|
||||||
|
+-----------------+
|
||||||
|
(low address)
|
||||||
|
```
|
||||||
|
|
||||||
|
Finally, the trampoline jumps into the execution of the host function using the
|
||||||
|
`exitSequence` meta-instruction.
|
||||||
|
|
||||||
|
Upon return, the process is reversed.
|
||||||
|
|
||||||
|
## Code
|
||||||
|
|
||||||
|
- The trampoline to enter the generated function is implemented by the
|
||||||
|
`backend.Machine.CompileEntryPreamble()` method.
|
||||||
|
- The trampoline to return traps and invoke host functions is generated by
|
||||||
|
`backend.Machine.CompileGoFunctionTrampoline()` method.
|
||||||
|
|
||||||
|
You can find arch-specific implementations in
|
||||||
|
`backend/isa/<arch>/abi_go_call.go`,
|
||||||
|
`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found
|
||||||
|
under `backend/isa/<arch>/abi_entry_<arch>.s`.
|
||||||
|
|
||||||
|
## Further References
|
||||||
|
|
||||||
|
- Go's [internal ABI documentation][abi-internal] details the calling convention similar to the one we use in both arm64 and amd64 backend.
|
||||||
|
- Raphael Poss's [The Go low-level calling convention on
|
||||||
|
x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
|
||||||
|
|
||||||
|
[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
|
||||||
|
[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
|
||||||
|
[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
|
||||||
|
[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/
|
||||||
|
|
||||||
507
site/content/docs/how_the_optimizing_compiler_works/backend.md
Normal file
507
site/content/docs/how_the_optimizing_compiler_works/backend.md
Normal file
@@ -0,0 +1,507 @@
|
|||||||
|
+++
|
||||||
|
title = "How the Optimizing Compiler Works: Back-End"
|
||||||
|
layout = "single"
|
||||||
|
+++
|
||||||
|
|
||||||
|
In this section we will discuss the phases in the back-end of the optimizing
|
||||||
|
compiler:
|
||||||
|
|
||||||
|
- [Instruction Selection](#instruction-selection)
|
||||||
|
- [Register Allocation](#register-allocation)
|
||||||
|
- [Finalization and Encoding](#finalization-and-encoding)
|
||||||
|
|
||||||
|
Each section will include a brief explanation of the phase, references to the
|
||||||
|
code that implements the phase, and a description of the debug flags that can
|
||||||
|
be used to inspect that phase. Please notice that, since the implementation of
|
||||||
|
the back-end is architecture-specific, the code might be different for each
|
||||||
|
architecture.
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
The higher-level entry-point to the back-end is the
|
||||||
|
`backend.Compiler.Compile(context.Context)` method. This method executes, in
|
||||||
|
turn, the following methods in the same type:
|
||||||
|
|
||||||
|
- `backend.Compiler.Lower()` (instruction selection)
|
||||||
|
- `backend.Compiler.RegAlloc()` (register allocation)
|
||||||
|
- `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
|
||||||
|
|
||||||
|
## Instruction Selection
|
||||||
|
|
||||||
|
The instruction selection phase is responsible for mapping the higher-level SSA
|
||||||
|
instructions to arch-specific instructions. Each SSA instruction is translated
|
||||||
|
to one or more machine instructions.
|
||||||
|
|
||||||
|
Each target architecture comes with a different number of registers, some of
|
||||||
|
them are general purpose, others might be specific to certain instructions. In
|
||||||
|
general, we can expect to have a set of registers for integer computations,
|
||||||
|
another set for floating point computations, a set for vector (SIMD)
|
||||||
|
computations, and some specific special-purpose registers (e.g. stack pointers,
|
||||||
|
program counters, status flags, etc.)
|
||||||
|
|
||||||
|
In addition, some registers might be reserved by the Go runtime or the
|
||||||
|
Operating System for specific purposes, so they should be handled with special
|
||||||
|
care.
|
||||||
|
|
||||||
|
At this point in the compilation process we do not want to deal with all that.
|
||||||
|
Instead, we assume that we have a potentially infinite number of *virtual
|
||||||
|
registers* of each type at our disposal. The next phase, the register
|
||||||
|
allocation phase, will map these virtual registers to the actual registers of
|
||||||
|
the target architecture.
|
||||||
|
|
||||||
|
### Operands and Addressing Modes
|
||||||
|
|
||||||
|
As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
|
||||||
|
then use that virtual register as one of the arguments of the machine
|
||||||
|
instruction that we will generate. However, usually instructions are able to
|
||||||
|
address more than just registers: an *operand* might be able to represent a
|
||||||
|
memory address, or an immediate value (i.e. a constant value that is encoded as
|
||||||
|
part of the instruction itself).
|
||||||
|
|
||||||
|
For these reasons, instead of mapping each `ssa.Value` to a virtual register
|
||||||
|
(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
|
||||||
|
`operand` type.
|
||||||
|
|
||||||
|
During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
|
||||||
|
an argument of the instruction, in the simplest case, the `operand` might be
|
||||||
|
mapped to a virtual register, in other cases, the `operand` might be mapped to
|
||||||
|
a memory address, or an immediate value. Sometimes this makes it possible to
|
||||||
|
replace several SSA instructions with a single machine instruction, by folding
|
||||||
|
the addressing mode into the instruction itself.
|
||||||
|
|
||||||
|
For instance, consider the following SSA instructions:
|
||||||
|
|
||||||
|
```
|
||||||
|
v4:i32 = Const 0x9
|
||||||
|
v6:i32 = Load v5, 0x4
|
||||||
|
v7:i32 = Iadd v6, v4
|
||||||
|
```
|
||||||
|
|
||||||
|
In the `amd64` architecture, the `add` instruction adds the second operand to
|
||||||
|
the first operand, and assigns the result to the second operand. So assuming
|
||||||
|
that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
|
||||||
|
registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
|
||||||
|
instruction on `amd64` might look like this:
|
||||||
|
|
||||||
|
```asm
|
||||||
|
;; AT&T syntax
|
||||||
|
add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
|
||||||
|
mov %r4?, %r7? ;; move the result from `r4?` to `r7?`
|
||||||
|
```
|
||||||
|
|
||||||
|
Notice how the load from memory has been folded into an operand of the `add`
|
||||||
|
instruction. This transformation is possible when the value produced by the
|
||||||
|
instruction being folded is not referenced by other instructions and the
|
||||||
|
instructions belong to the same `InstructionGroupID` (see [Front-End:
|
||||||
|
Optimization](../frontend/#optimization)).
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
At the end of the instruction selection phase, the basic blocks of our `abs`
|
||||||
|
function will look as follows (for `arm64`):
|
||||||
|
|
||||||
|
```asm
|
||||||
|
L1 (SSA Block: blk0):
|
||||||
|
mov x130?, x2
|
||||||
|
subs wzr, w130?, #0x0
|
||||||
|
b.ge L2
|
||||||
|
L3 (SSA Block: blk1):
|
||||||
|
mov x136?, xzr
|
||||||
|
sub w134?, w136?, w130?
|
||||||
|
mov x135?, x134?
|
||||||
|
b L4
|
||||||
|
L2 (SSA Block: blk2):
|
||||||
|
mov x135?, x130?
|
||||||
|
L4 (SSA Block: blk3):
|
||||||
|
mov x0, x135?
|
||||||
|
ret
|
||||||
|
```
|
||||||
|
|
||||||
|
Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
|
||||||
|
These are labels that are used to mark the beginning of each basic block, and
|
||||||
|
they are the target for branching instructions such as `b` and `b.ge`.
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
`backend.Machine` is the interface to the backend. It has a methods to
|
||||||
|
translate (lower) the IR to machine code. Again, as seen earlier in the
|
||||||
|
front-end, the term *lowering* is used to indicate translation from a
|
||||||
|
higher-level representation to a lower-level representation.
|
||||||
|
|
||||||
|
`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
|
||||||
|
SSA instruction to machine code. Machine-specific implementations of this
|
||||||
|
method can be found in package `backend/isa/<arch>` where `<arch>` is either
|
||||||
|
`amd64` or `arm64`.
|
||||||
|
|
||||||
|
### Debug Flags
|
||||||
|
|
||||||
|
`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
|
||||||
|
lowered arch-specific instructions.
|
||||||
|
|
||||||
|
## Register Allocation
|
||||||
|
|
||||||
|
The register allocation phase is responsible for mapping the potentially
|
||||||
|
infinite number of virtual registers to the real registers of the target
|
||||||
|
architecture. Because the number of real registers is limited, the register
|
||||||
|
allocation phase might need to "spill" some of the virtual registers to memory;
|
||||||
|
that is, it might store their content, and then load them back into a register
|
||||||
|
when they are needed.
|
||||||
|
|
||||||
|
For a given function `f` the register allocation procedure
|
||||||
|
`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
|
||||||
|
|
||||||
|
- `livenessAnalysis(f)` collects the "liveness" information for each virtual
|
||||||
|
register. The algorithm is described in [Chapter 9.2 of The SSA
|
||||||
|
Book][ssa-book].
|
||||||
|
|
||||||
|
- `alloc(f)` allocates registers for the given function. The algorithm is
|
||||||
|
derived from [the Go compiler's
|
||||||
|
allocator][go-regalloc]
|
||||||
|
|
||||||
|
At the end of the allocation procedure, we also record the set of registers
|
||||||
|
that are **clobbered** by the body of the function. A register is clobbered
|
||||||
|
if its value is overwritten by the function, and it is not saved by the
|
||||||
|
callee. This information is used in the finalization phase to determine which
|
||||||
|
registers need to be saved in the prologue and restored in the epilogue.
|
||||||
|
to register allocation in a textbook meaning, but it is a necessary step
|
||||||
|
for the finalization phase.
|
||||||
|
|
||||||
|
### Liveness Analysis
|
||||||
|
|
||||||
|
Intuitively, a variable or name binding can be considered _live_ at a certain
|
||||||
|
point in a program, if its value will be used in the future.
|
||||||
|
|
||||||
|
For instance:
|
||||||
|
|
||||||
|
```
|
||||||
|
1| int f(int x) {
|
||||||
|
2| int y = 2 + x;
|
||||||
|
3| int z = x + y;
|
||||||
|
4| return z;
|
||||||
|
5| }
|
||||||
|
```
|
||||||
|
|
||||||
|
Variable `x` and `y` are both live at line 4, because they are used in the
|
||||||
|
expression `x + y` on line 3; variable `z` is live at line 4, because it is
|
||||||
|
used in the return statement. However, variables `x` and `y` can be considered
|
||||||
|
_not_ live at line 4 because they are not used anywhere after line 3.
|
||||||
|
|
||||||
|
Statically, _liveness_ can be approximated by following paths backwards on the
|
||||||
|
control-flow graph, connecting the uses of a given variable to its definitions
|
||||||
|
(or its *unique* definition, assuming SSA form).
|
||||||
|
|
||||||
|
In practice, while liveness is a property of each name binding at any point in
|
||||||
|
the program, it is enough to keep track of liveness at the boundaries of basic
|
||||||
|
blocks:
|
||||||
|
|
||||||
|
- the _live-in_ set for a given basic block is the set of all bindings that are
|
||||||
|
live at the entry of that block.
|
||||||
|
- the _live-out_ set for a given basic block is the set of all bindings that
|
||||||
|
are live at the exit of that block. A binding is live at the exit of a block
|
||||||
|
if it is live at the entry of a successor.
|
||||||
|
|
||||||
|
Because the CFG is a connected graph, it is enough to keep track of either
|
||||||
|
live-in or live-out sets, and then propagate the liveness information backward
|
||||||
|
or forward, respectively. In our case, we keep track of live-in sets per block;
|
||||||
|
live-outs are derived from live-ins of the successor blocks when a block is
|
||||||
|
allocated.
|
||||||
|
|
||||||
|
### Allocation
|
||||||
|
|
||||||
|
We implemented a variant of the linear scan register allocation algorithm
|
||||||
|
described in [the Go compiler's allocator][go-regalloc].
|
||||||
|
|
||||||
|
Each basic block is allocated registers in a linear scan order, and the
|
||||||
|
allocation state is propagated from a given basic block to its successors.
|
||||||
|
Then, each block continues allocation from that initial state.
|
||||||
|
|
||||||
|
#### Merge States
|
||||||
|
|
||||||
|
Special care has to be taken when a block has multiple predecessors. We call
|
||||||
|
this *fixing merge states*: for instance, consider the following:
|
||||||
|
|
||||||
|
```goat { width="30%" }
|
||||||
|
.---. .---.
|
||||||
|
| BB0 | | BB1 |
|
||||||
|
'-+-' '-+-'
|
||||||
|
+----+----+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
.---.
|
||||||
|
| BB2 |
|
||||||
|
'---'
|
||||||
|
```
|
||||||
|
|
||||||
|
if the live-out set of a given block `BB0` is different from the live-out set
|
||||||
|
of a given block `BB1` and both are predecessors of a block `BB2`, then we need
|
||||||
|
to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
|
||||||
|
abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers
|
||||||
|
or via stack; fixing merge states ensures that registers and stack are used
|
||||||
|
consistently to pass values across the involved states.
|
||||||
|
|
||||||
|
#### Spilling
|
||||||
|
|
||||||
|
If the register allocator cannot find a free register for a given virtual
|
||||||
|
(live) register, it needs to "spill" the value to the stack to get a free
|
||||||
|
register, *i.e.,* stash it temporarily to stack. When that virtual register is
|
||||||
|
reused later, we will have to insert instructions to reload the value into a
|
||||||
|
real register.
|
||||||
|
|
||||||
|
While the procedure proceeds with allocation, the procedure also records all
|
||||||
|
the virtual registers that transition to the "spilled" state, and inserts the
|
||||||
|
reload instructions when those registers are reused later.
|
||||||
|
|
||||||
|
The spill instructions are actually inserted at the end of the register
|
||||||
|
allocation, after all the allocations and the merge states have been fixed. At
|
||||||
|
this point, all the other potential sources of instability have been resolved,
|
||||||
|
and we know where all the reloads happen.
|
||||||
|
|
||||||
|
We insert the spills in the block that is the lowest common ancestor of all the
|
||||||
|
blocks that reload the value.
|
||||||
|
|
||||||
|
#### Clobbered Registers
|
||||||
|
|
||||||
|
At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
|
||||||
|
method iterates over the set of the allocated registers and compares them
|
||||||
|
to a set of architecture-specific set `CalleeSavedRegisters`. If a register
|
||||||
|
has been allocated, and it is present in this set, the register is marked as
|
||||||
|
"clobbered", i.e., we now know that the register allocator will overwrite
|
||||||
|
that value. Thus, these values will have to be spilled in the prologue.
|
||||||
|
|
||||||
|
#### References
|
||||||
|
|
||||||
|
Register allocation is a complex problem, possibly the most complicated
|
||||||
|
part of the backend. The following references were used to implement the
|
||||||
|
algorithm:
|
||||||
|
|
||||||
|
- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
|
||||||
|
- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
|
||||||
|
- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
|
||||||
|
- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
|
||||||
|
- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
|
||||||
|
|
||||||
|
We suggest to refer to them to dive deeper in the topic.
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
At the end of the register allocation phase, the basic blocks of our `abs`
|
||||||
|
function look as follows (for `arm64`):
|
||||||
|
|
||||||
|
```asm
|
||||||
|
L1 (SSA Block: blk0):
|
||||||
|
mov x2, x2
|
||||||
|
subs wzr, w2, #0x0
|
||||||
|
b.ge L2
|
||||||
|
L3 (SSA Block: blk1):
|
||||||
|
mov x8, xzr
|
||||||
|
sub w8, w8, w2
|
||||||
|
mov x8, x8
|
||||||
|
b L4
|
||||||
|
L2 (SSA Block: blk2):
|
||||||
|
mov x8, x2
|
||||||
|
L4 (SSA Block: blk3):
|
||||||
|
mov x0, x8
|
||||||
|
ret
|
||||||
|
```
|
||||||
|
|
||||||
|
Notice how the virtual registers have been all replaced by real registers, i.e.
|
||||||
|
no register identifier is suffixed with `?`. This example is quite simple, and
|
||||||
|
it does not require any spill.
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
|
||||||
|
interfaces in `regalloc/api.go`.
|
||||||
|
|
||||||
|
Essentially:
|
||||||
|
|
||||||
|
- each architecture exposes iteration over basic blocks of a function
|
||||||
|
(`regalloc.Function` interface)
|
||||||
|
- each arch-specific basic block exposes iteration over instructions
|
||||||
|
(`regalloc.Block` interface)
|
||||||
|
- each arch-specific instruction exposes the set of registers it defines and
|
||||||
|
uses (`regalloc.Instr` interface)
|
||||||
|
|
||||||
|
By defining these interfaces, the register allocation algorithm can assign real
|
||||||
|
registers to virtual registers without dealing specifically with the target
|
||||||
|
architecture.
|
||||||
|
|
||||||
|
In practice, each interface is usually implemented by instantiating a common
|
||||||
|
generic struct that comes already with an implementation of all or most of the
|
||||||
|
required methods. For instance,`regalloc.Function`is implemented by
|
||||||
|
`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
|
||||||
|
|
||||||
|
`backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
|
||||||
|
contains the instantiation of the `regalloc.RegisterInfo` struct, which
|
||||||
|
declares, among others
|
||||||
|
- the set of registers that are available for allocation, excluding, for
|
||||||
|
instance, those that might be reserved by the runtime or the OS
|
||||||
|
(`AllocatableRegisters`)
|
||||||
|
- the registers that might be saved by the callee to the stack
|
||||||
|
(`CalleeSavedRegisters`)
|
||||||
|
|
||||||
|
### Debug Flags
|
||||||
|
|
||||||
|
- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
|
||||||
|
allocation procedure.
|
||||||
|
- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
|
||||||
|
allocation result.
|
||||||
|
|
||||||
|
## Finalization and Encoding
|
||||||
|
|
||||||
|
At the end of the register allocation phase, we have enough information to
|
||||||
|
finally generate machine code (_encoding_). We are only missing the prologue
|
||||||
|
and epilogue of the function.
|
||||||
|
|
||||||
|
### Prologue and Epilogue
|
||||||
|
|
||||||
|
As usual, the **prologue** is executed before the main body of the function,
|
||||||
|
and the **epilogue** is executed at the return. The prologue is responsible for
|
||||||
|
setting up the stack frame, and the epilogue is responsible for cleaning up the
|
||||||
|
stack frame and returning control to the caller.
|
||||||
|
|
||||||
|
Generally, this means, at the very least:
|
||||||
|
- saving the return address
|
||||||
|
- a base pointer to the stack; or, equivalently, the height of the stack at the
|
||||||
|
beginning of the function
|
||||||
|
|
||||||
|
For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
|
||||||
|
pointer:
|
||||||
|
|
||||||
|
```goat {width="100%" height="250"}
|
||||||
|
(high address) (high address)
|
||||||
|
RBP ----> +-----------------+ +-----------------+
|
||||||
|
| `...` | | `...` |
|
||||||
|
| ret Y | | ret Y |
|
||||||
|
| `...` | | `...` |
|
||||||
|
| ret 0 | | ret 0 |
|
||||||
|
| arg X | | arg X |
|
||||||
|
| `...` | ====> | `...` |
|
||||||
|
| arg 1 | | arg 1 |
|
||||||
|
| arg 0 | | arg 0 |
|
||||||
|
| Return Addr | | Return Addr |
|
||||||
|
RSP ----> +-----------------+ | Caller_RBP |
|
||||||
|
(low address) +-----------------+ <----- RSP, RBP
|
||||||
|
```
|
||||||
|
|
||||||
|
While, on `arm64`, there is only a stack pointer `SP`:
|
||||||
|
|
||||||
|
|
||||||
|
```goat {width="100%" height="300"}
|
||||||
|
(high address) (high address)
|
||||||
|
SP ---> +-----------------+ +------------------+ <----+
|
||||||
|
| `...` | | `...` | |
|
||||||
|
| ret Y | | ret Y | |
|
||||||
|
| `...` | | `...` | |
|
||||||
|
| ret 0 | | ret 0 | |
|
||||||
|
| arg X | | arg X | | size_of_arg_ret.
|
||||||
|
| `...` | ====> | `...` | |
|
||||||
|
| arg 1 | | arg 1 | |
|
||||||
|
| arg 0 | | arg 0 | <----+
|
||||||
|
+-----------------+ | size_of_arg_ret |
|
||||||
|
| return address |
|
||||||
|
+------------------+ <---- SP
|
||||||
|
(low address) (low address)
|
||||||
|
```
|
||||||
|
|
||||||
|
However, the prologue and epilogue might also be responsible for saving and
|
||||||
|
restoring the state of registers that might be overwritten by the function
|
||||||
|
("clobbered"); and, if spilling occurs, prologue and epilogue are also
|
||||||
|
responsible for reserving and releasing the space for the spilled values.
|
||||||
|
|
||||||
|
For clarity, we make a distinction between the space reserved for the clobbered
|
||||||
|
registers and the space reserved for the spilled values:
|
||||||
|
|
||||||
|
- Spill slots are used to temporarily store the values that needs spilling as
|
||||||
|
determined by the register allocator. This section must have a fix height,
|
||||||
|
but its contents will change over time, as registers are being spilled and
|
||||||
|
reloaded.
|
||||||
|
- Clobbered registers are, similarly, determined by the register allocator, but
|
||||||
|
they are stashed in the prologue and then restored in the epilogue.
|
||||||
|
|
||||||
|
The procedure happens after the register allocation phase because at
|
||||||
|
this point we have collected enough information to know how much space we need
|
||||||
|
to reserve, and which registers are clobbered.
|
||||||
|
|
||||||
|
Regardless of the architecture, after allocating this space, the stack will
|
||||||
|
look as follows:
|
||||||
|
|
||||||
|
```goat {height="350"}
|
||||||
|
(high address)
|
||||||
|
+-----------------+
|
||||||
|
| `...` |
|
||||||
|
| ret Y |
|
||||||
|
| `...` |
|
||||||
|
| ret 0 |
|
||||||
|
| arg X |
|
||||||
|
| `...` |
|
||||||
|
| arg 1 |
|
||||||
|
| arg 0 |
|
||||||
|
| (arch-specific) |
|
||||||
|
+-----------------+
|
||||||
|
| clobbered M |
|
||||||
|
| ............ |
|
||||||
|
| clobbered 1 |
|
||||||
|
| clobbered 0 |
|
||||||
|
| spill slot N |
|
||||||
|
| ............ |
|
||||||
|
| spill slot 0 |
|
||||||
|
+-----------------+
|
||||||
|
(low address)
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: the prologue might also introduce a check of the stack bounds. If there
|
||||||
|
is no sufficient space to allocate the stack frame, the function will exit the
|
||||||
|
execution and will try to grow it from the Go runtime.
|
||||||
|
|
||||||
|
The epilogue simply reverses the operations of the prologue.
|
||||||
|
|
||||||
|
### Other Post-RegAlloc Logic
|
||||||
|
|
||||||
|
The `backend.Machine.PostRegAlloc` method is invoked after the register
|
||||||
|
allocation procedure; while its main role is to define the prologue and
|
||||||
|
epilogue of the function, it also serves as a hook to perform other,
|
||||||
|
arch-specific duty, that has to happen after the register allocation phase.
|
||||||
|
|
||||||
|
For instance, on `amd64`, the constraints for some instructions are hard to
|
||||||
|
express in a meaningful way for the register allocation procedure (for
|
||||||
|
instance, the `div` instruction implicitly use registers `rdx`, `rax`).
|
||||||
|
Instead, they are lowered with ad-hoc logic as part of the implementation
|
||||||
|
`backend.Machine.PostRegAlloc` method.
|
||||||
|
|
||||||
|
### Encoding
|
||||||
|
|
||||||
|
The final stage of the backend encodes the machine instructions into bytes and
|
||||||
|
writes them to the target buffer. Before proceeding with the encoding, relative
|
||||||
|
addresses in branching instructions or addressing modes are resolved.
|
||||||
|
|
||||||
|
The procedure encodes the instructions in the order they appear in the
|
||||||
|
function.
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
- The prologue and epilogue are set up as part of the
|
||||||
|
`backend.Machine.PostRegAlloc` method.
|
||||||
|
- The encoding is done by the `backend.Machine.Encode` method.
|
||||||
|
|
||||||
|
### Debug Flags
|
||||||
|
|
||||||
|
- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
|
||||||
|
function after the finalization phase.
|
||||||
|
- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
|
||||||
|
representation of the function generated code as it is.
|
||||||
|
- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
|
||||||
|
representation of the function generated code that can be disassembled.
|
||||||
|
|
||||||
|
The reason for the distinction between the last two flags is that the generated
|
||||||
|
code in some cases might not be disassemblable.
|
||||||
|
`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
|
||||||
|
the generated code that can be disassembled, but cannot be executed.
|
||||||
|
|
||||||
|
<hr>
|
||||||
|
|
||||||
|
* Previous Section: [Front-End](../frontend/)
|
||||||
|
* Next Section: [Appendix: Trampolines](../appendix/)
|
||||||
|
|
||||||
|
[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
|
||||||
|
[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
|
||||||
371
site/content/docs/how_the_optimizing_compiler_works/frontend.md
Normal file
371
site/content/docs/how_the_optimizing_compiler_works/frontend.md
Normal file
@@ -0,0 +1,371 @@
|
|||||||
|
+++
|
||||||
|
title = "How the Optimizing Compiler Works: Front-End"
|
||||||
|
layout = "single"
|
||||||
|
+++
|
||||||
|
|
||||||
|
In this section we will discuss the phases in the front-end of the optimizing compiler:
|
||||||
|
|
||||||
|
- [Translation to SSA](#translation-to-ssa)
|
||||||
|
- [Optimization](#optimization)
|
||||||
|
- [Block Layout](#block-layout)
|
||||||
|
|
||||||
|
Every section includes an explanation of the phase; the subsection **Code**
|
||||||
|
will include high-level pointers to functions and packages; the subsection **Debug Flags**
|
||||||
|
indicates the flags that can be used to enable advanced logging of the phase.
|
||||||
|
|
||||||
|
## Translation to SSA
|
||||||
|
|
||||||
|
We mentioned earlier that wazero uses an internal representation called an "SSA"
|
||||||
|
form or "Static Single-Assignment" form, but we never explained what that is.
|
||||||
|
|
||||||
|
In short terms, every program, or, in our case, every Wasm function, can be
|
||||||
|
translated in a control-flow graph. The control-flow graph is a directed graph where
|
||||||
|
each node is a sequence of statements that do not contain a control flow instruction,
|
||||||
|
called a **basic block**. Instead, control-flow instructions are translated into edges.
|
||||||
|
|
||||||
|
For instance, take the following implementation of the `abs` function:
|
||||||
|
|
||||||
|
```wasm
|
||||||
|
(module
|
||||||
|
(func (;0;) (param i32) (result i32)
|
||||||
|
(if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
|
||||||
|
(then
|
||||||
|
(i32.sub (i32.const 0) (local.get 0)))
|
||||||
|
(else
|
||||||
|
(local.get 0))
|
||||||
|
)
|
||||||
|
)
|
||||||
|
(export "f" (func 0))
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
This is translated to the following block diagram:
|
||||||
|
|
||||||
|
```goat {width="100%" height="500"}
|
||||||
|
+---------------------------------------------+
|
||||||
|
|blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
|
||||||
|
| v3:i32 = Iconst_32 0x0 |
|
||||||
|
| v4:i32 = Icmp lt_s, v2, v3 |
|
||||||
|
| Brz v4, blk2 |
|
||||||
|
| Jump blk1 |
|
||||||
|
+---------------------------------------------+
|
||||||
|
|
|
||||||
|
|
|
||||||
|
+---`(v4 != 0)`-+-`(v4 == 0)`---+
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
+---------------------------+ +---------------------------+
|
||||||
|
|blk1: () <-- (blk0) | |blk2: () <-- (blk0) |
|
||||||
|
| v6:i32 = Iconst_32 0x0 | | Jump blk3, v2 |
|
||||||
|
| v7:i32 = Isub v6, v2 | | |
|
||||||
|
| Jump blk3, v7 | | |
|
||||||
|
+---------------------------+ +---------------------------+
|
||||||
|
| |
|
||||||
|
| |
|
||||||
|
+-`{v5 := v7}`--+--`{v5 := v2}`-+
|
||||||
|
|
|
||||||
|
v
|
||||||
|
+------------------------------+
|
||||||
|
|blk3: (v5:i32) <-- (blk1,blk2)|
|
||||||
|
| Jump blk_ret, v5 |
|
||||||
|
+------------------------------+
|
||||||
|
|
|
||||||
|
{return v5}
|
||||||
|
|
|
||||||
|
v
|
||||||
|
```
|
||||||
|
|
||||||
|
We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
|
||||||
|
representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
|
||||||
|
takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
|
||||||
|
Jump, etc...) with an optional list of arguments; these arguments are assigned
|
||||||
|
to the target block's arguments like a function.
|
||||||
|
|
||||||
|
Consider the first block `blk0`.
|
||||||
|
|
||||||
|
```
|
||||||
|
blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
|
||||||
|
v3:i32 = Iconst_32 0x0
|
||||||
|
v4:i32 = Icmp lt_s, v2, v3
|
||||||
|
Brz v4, blk2
|
||||||
|
Jump blk1
|
||||||
|
```
|
||||||
|
|
||||||
|
You will notice that, compared to the original function, it takes two extra
|
||||||
|
parameters (`exec_ctx` and `module_ctx`):
|
||||||
|
|
||||||
|
1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
|
||||||
|
in the face of traps or for host function calls.
|
||||||
|
2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
|
||||||
|
to access memory.
|
||||||
|
|
||||||
|
It then takes one parameter `v2`, corresponding to the function parameter, and
|
||||||
|
it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
|
||||||
|
comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
|
||||||
|
`blk2` if `v4` is zero, otherwise it jumps to `blk1`.
|
||||||
|
|
||||||
|
You might also have noticed that the instructions do not correspond strictly to
|
||||||
|
the original Wasm opcodes. This is because, similarly to the wazero IR used by
|
||||||
|
the old compiler, this is a custom IR.
|
||||||
|
|
||||||
|
You will also notice that, _on the right-hand side of the assignments_ of any statement,
|
||||||
|
no name occurs _twice_: this is why this form is called **single-assignment**.
|
||||||
|
|
||||||
|
Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
|
||||||
|
|
||||||
|
```
|
||||||
|
blk1: ()
|
||||||
|
...
|
||||||
|
Jump blk3, v7
|
||||||
|
|
||||||
|
blk2: ()
|
||||||
|
Jump blk3, v2
|
||||||
|
|
||||||
|
blk3: (v5:i32)
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
|
||||||
|
to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
|
||||||
|
depending on the originating block. If you are familiar with the traditional
|
||||||
|
representation of an SSA form, you will recognize that the role of block
|
||||||
|
arguments is equivalent to the role of the *Phi (Φ) function*, a special
|
||||||
|
function that returns a different value depending on the incoming edge; e.g., in
|
||||||
|
this case: `v5 := Φ(v7, v2)`.
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
The relevant APIs can be found under sub-package `ssa` and `frontend`.
|
||||||
|
In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
|
||||||
|
because such transformations usually correspond to targeting a lower abstraction level.
|
||||||
|
|
||||||
|
- Basic Blocks are represented by the type `ssa.Block`.
|
||||||
|
- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
|
||||||
|
in the context of `wasm.Engine.CompileModule()`, more specifically in the method
|
||||||
|
`frontend.Compiler.LowerToSSA()`.
|
||||||
|
- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
|
||||||
|
more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
|
||||||
|
- Because they are semantically equivalent, in the code, basic block parameters
|
||||||
|
are sometimes referred to as "Phi values".
|
||||||
|
|
||||||
|
#### Instructions and Values
|
||||||
|
|
||||||
|
An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
|
||||||
|
consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
|
||||||
|
instructions may not produce any value (for instance, a `Jump` instruction).
|
||||||
|
An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
|
||||||
|
to represent the result of an instruction, or the input to an instruction.
|
||||||
|
|
||||||
|
For instance:
|
||||||
|
|
||||||
|
```
|
||||||
|
blk1: () <-- (blk0)
|
||||||
|
v6:i32 = Iconst_32 0x0
|
||||||
|
v7:i32 = Isub v6, v2
|
||||||
|
Jump blk3, v7
|
||||||
|
```
|
||||||
|
|
||||||
|
`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
|
||||||
|
and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
|
||||||
|
such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
|
||||||
|
|
||||||
|
- `i32`: 32-bit integer
|
||||||
|
- `i64`: 64-bit integer
|
||||||
|
- `f32`: 32-bit floating point
|
||||||
|
- `f64`: 64-bit floating point
|
||||||
|
- `v128`: 128-bit SIMD vector
|
||||||
|
|
||||||
|
For simplicity, we don't have a dedicated type for pointers. Instead, we use the `i64`
|
||||||
|
type to represent pointer values since we only support 64-bit architectures,
|
||||||
|
unlike traditional compilers such as LLVM.
|
||||||
|
|
||||||
|
Values and instructions are both allocated from pools to minimize memory allocations.
|
||||||
|
|
||||||
|
### Debug Flags
|
||||||
|
|
||||||
|
- `wazevoapi.PrintSSA` dumps the SSA form to the console.
|
||||||
|
- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
|
||||||
|
opcodes and SSA instructions to the console.
|
||||||
|
|
||||||
|
## Optimization
|
||||||
|
|
||||||
|
The SSA form makes it easier to perform a number of optimizations. For instance,
|
||||||
|
we can perform constant propagation, dead code elimination, and common
|
||||||
|
subexpression elimination. These optimizations either act upon the instructions
|
||||||
|
within a basic block, or they act upon the control-flow graph as a whole.
|
||||||
|
|
||||||
|
On a high, level, consider the following basic block, derived from the previous
|
||||||
|
example:
|
||||||
|
|
||||||
|
```
|
||||||
|
blk0: (exec_ctx:i64, module_ctx:i64)
|
||||||
|
v2:i32 = Iconst_32 -5
|
||||||
|
v3:i32 = Iconst_32 0
|
||||||
|
v4:i32 = Icmp lt_s, v2, v3
|
||||||
|
Brz v4, blk2
|
||||||
|
Jump blk1
|
||||||
|
```
|
||||||
|
|
||||||
|
It is pretty easy to see that the comparison in `v4` can be replaced by a
|
||||||
|
constant `1`, because the comparison is between two constant values (-5, 0).
|
||||||
|
Therefore, the block can be rewritten as such:
|
||||||
|
|
||||||
|
```
|
||||||
|
blk0: (exec_ctx:i64, module_ctx:i64)
|
||||||
|
v4:i32 = Iconst_32 1
|
||||||
|
Brz v4, blk2
|
||||||
|
Jump blk1
|
||||||
|
```
|
||||||
|
|
||||||
|
However, we can now also see that the branch is always taken, and that the block
|
||||||
|
`blk2` is never executed, so even the branch instruction and the constant
|
||||||
|
definition `v4` can be removed:
|
||||||
|
|
||||||
|
```
|
||||||
|
blk0: (exec_ctx:i64, module_ctx:i64)
|
||||||
|
Jump blk1
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a simple example of constant propagation and dead code elimination
|
||||||
|
occurring within a basic block. However, now `blk2` is unreachable, because
|
||||||
|
there is no other edge in the edge that points to it; thus it can be removed
|
||||||
|
from the control-flow graph. This is an example of dead-code elimination that
|
||||||
|
occurs at the control-flow graph level.
|
||||||
|
|
||||||
|
In practice, because WebAssembly is a compilation target, these simple
|
||||||
|
optimizations are often unnecessary. The optimization passes implemented in
|
||||||
|
wazero are also work-in-progress and, at the time of writing, further work is
|
||||||
|
expected to implement more advanced optimizations.
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
|
||||||
|
pass is just a function that takes a ssa builder as a parameter.
|
||||||
|
|
||||||
|
Passes iterate over the basic blocks, and, for each basic block, they iterate
|
||||||
|
over the instructions. Each pass may mutate the basic block by modifying the instructions
|
||||||
|
it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
|
||||||
|
blocks).
|
||||||
|
|
||||||
|
Currently, there are two dead-code elimination passes:
|
||||||
|
|
||||||
|
- `passDeadBlockEliminationOpt` acting at the block-level.
|
||||||
|
- `passDeadCodeEliminationOpt` acting at instruction-level.
|
||||||
|
|
||||||
|
Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
|
||||||
|
instruction. This is used to determine whether a sequence of instructions can be
|
||||||
|
replaced by a single machine instruction during the back-end phase. For more details,
|
||||||
|
see also the relevant documentation in `ssa/instructions.go`
|
||||||
|
|
||||||
|
There are also simple constant folding passes such as `passNopInstElimination`, which
|
||||||
|
folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
|
||||||
|
|
||||||
|
### Debug Flags
|
||||||
|
|
||||||
|
`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
|
||||||
|
|
||||||
|
|
||||||
|
## Block Layout
|
||||||
|
|
||||||
|
As we have seen earlier, the SSA form instructions are contained within basic
|
||||||
|
blocks, and the basic blocks are connected by edges of the control-flow graph.
|
||||||
|
However, machine code is not laid out in a graph, but it is just a linear
|
||||||
|
sequence of instructions.
|
||||||
|
|
||||||
|
Thus, the last step of the front-end is to lay out the basic blocks in a linear
|
||||||
|
sequence. Because each basic block, by design, ends with a control-flow
|
||||||
|
instruction, one of the goals of the block layout phase is to maximize the number of
|
||||||
|
**fall-through opportunities**. A fall-through opportunity occurs when a block ends
|
||||||
|
with a jump instruction whose target is exactly the next block in the
|
||||||
|
sequence. In order to maximize the number of fall-through opportunities, the
|
||||||
|
block layout phase might reorder the basic blocks in the control-flow graph,
|
||||||
|
and transform the control-flow instructions. For instance, it might _invert_
|
||||||
|
some branching conditions.
|
||||||
|
|
||||||
|
The end goal is to effectively minimize the number of jumps and branches in
|
||||||
|
the machine code that will be generated later.
|
||||||
|
|
||||||
|
|
||||||
|
### Critical Edges
|
||||||
|
|
||||||
|
Special attention must be taken when a basic block has multiple predecessors,
|
||||||
|
i.e., when it has multiple incoming edges. In particular, an edge between two
|
||||||
|
basic blocks is called a **critical edge** when, at the same time:
|
||||||
|
- the predecessor has multiple successors **and**
|
||||||
|
- the successor has multiple predecessors.
|
||||||
|
|
||||||
|
For instance, in the example below the edge between `BB0` and `BB3`
|
||||||
|
is a critical edge.
|
||||||
|
|
||||||
|
```goat { width="300" }
|
||||||
|
┌───────┐ ┌───────┐
|
||||||
|
│ BB0 │━┓ │ BB1 │
|
||||||
|
└───────┘ ┃ └───────┘
|
||||||
|
│ ┃ │
|
||||||
|
▼ ┃ ▼
|
||||||
|
┌───────┐ ┃ ┌───────┐
|
||||||
|
│ BB2 │ ┗━▶│ BB3 │
|
||||||
|
└───────┘ └───────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
In these cases the critical edge is split by introducing a new basic block,
|
||||||
|
called a **trampoline**, where the critical edge was.
|
||||||
|
|
||||||
|
```goat { width="300" }
|
||||||
|
┌───────┐ ┌───────┐
|
||||||
|
│ BB0 │──────┐ │ BB1 │
|
||||||
|
└───────┘ ▼ └───────┘
|
||||||
|
│ ┌──────────┐ │
|
||||||
|
│ │trampoline│ │
|
||||||
|
▼ └──────────┘ ▼
|
||||||
|
┌───────┐ │ ┌───────┐
|
||||||
|
│ BB2 │ └────▶│ BB3 │
|
||||||
|
└───────┘ └───────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
For more details on critical edges read more at
|
||||||
|
|
||||||
|
- https://en.wikipedia.org/wiki/Control-flow_graph
|
||||||
|
- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
|
||||||
|
|
||||||
|
### Example
|
||||||
|
|
||||||
|
At the end of the block layout phase, the laid out SSA for the `abs` function
|
||||||
|
looks as follows:
|
||||||
|
|
||||||
|
```
|
||||||
|
blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
|
||||||
|
v3:i32 = Iconst_32 0x0
|
||||||
|
v4:i32 = Icmp lt_s, v2, v3
|
||||||
|
Brz v4, blk2
|
||||||
|
Jump fallthrough
|
||||||
|
|
||||||
|
blk1: () <-- (blk0)
|
||||||
|
v6:i32 = Iconst_32 0x0
|
||||||
|
v7:i32 = Isub v6, v2
|
||||||
|
Jump blk3, v7
|
||||||
|
|
||||||
|
blk2: () <-- (blk0)
|
||||||
|
Jump fallthrough, v2
|
||||||
|
|
||||||
|
blk3: (v5:i32) <-- (blk1,blk2)
|
||||||
|
Jump blk_ret, v5
|
||||||
|
```
|
||||||
|
|
||||||
|
### Code
|
||||||
|
|
||||||
|
`passLayoutBlocks` implements the block layout phase.
|
||||||
|
|
||||||
|
### Debug Flags
|
||||||
|
|
||||||
|
- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
|
||||||
|
- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
|
||||||
|
such as inverting branching conditions or splitting critical edges.
|
||||||
|
|
||||||
|
<hr>
|
||||||
|
|
||||||
|
* Previous Section: [How the Optimizing Compiler Works](../)
|
||||||
|
* Next Section: [Back-End](../backend/)
|
||||||
|
|
||||||
|
[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
|
||||||
|
[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes
|
||||||
Reference in New Issue
Block a user