wazevo(docs): optimizing compiler (#2065)

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
This commit is contained in:
Edoardo Vacchi
2024-03-09 08:39:11 +08:00
committed by GitHub
parent 15cc0c59f3
commit b7b54d5967
5 changed files with 1196 additions and 1 deletions

View File

@@ -143,7 +143,8 @@ Notably, the interpreter and compiler in wazero's [Runtime configuration][Runtim
In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT)
during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute
natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of
magnitude (10x) or more, and therefore enabled by default whenever available.
magnitude (10x) or more, and therefore enabled by default whenever available. You can read more about wazero's
[optimizing compiler in the detailed documentation]({{< relref "/how_the_optimizing_compiler_works" >}}).
#### Interpreter

View File

@@ -0,0 +1,131 @@
+++
title = "How the Optimizing Compiler Works"
layout = "single"
+++
wazero supports two modes of execution: interpreter mode and compilation mode.
The interpreter mode is a fallback mode for platforms where compilation is not
supported. Compilation mode is otherwise the default mode of execution: it
translates Wasm modules to native code to get the best run-time performance.
Translating Wasm bytecode into machine code can take multiple forms. wazero
1.0 performs a straightforward translation from a given instruction to a native
instruction. wazero 2.0 introduces an optimizing compiler that is able to
perform nontrivial optimizing transformations, such as constant folding or
dead-code elimination, and it makes better use of the underlying hardware, such
as CPU registers. This document digs deeper into what we mean when we say
"optimizing compiler", and explains how it is implemented in wazero.
This document is intended for maintainers, researchers, developers and in
general anyone interested in understanding the internals of wazero.
What is an Optimizing Compiler?
-------------------------------
Wazero supports an _optimizing_ compiler in the style of other optimizing
compilers such as LLVM's or V8's. Traditionally an optimizing
compiler performs compilation in a number of steps.
Compare this to the **old compiler**, where compilation happens in one step or
two, depending on how you count:
```goat
Input +---------------+ +---------------+
Wasm Binary ---->| DecodeModule |---->| CompileModule |----> wazero IR
+---------------+ +---------------+
```
That is, the module is (1) validated then (2) translated to an Intermediate
Representation (IR). The wazero IR can then be executed directly (in the case
of the interpreter) or it can be further processed and translated into native
code by the compiler. This compiler performs a straightforward translation from
the IR to native code, without any further passes. The wazero IR is not intended
for further processing beyond immediate execution or straightforward
translation.
```goat
+---- wazero IR ----+
| |
v v
+--------------+ +--------------+
| Compiler | | Interpreter |- - - executable
+--------------+ +--------------+
|
+----------+---------+
| |
v v
+---------+ +---------+
| ARM64 | | AMD64 |
| Backend | | Backend | - - - - - - - - - executable
+---------+ +---------+
```
Validation and translation to an IR in a compiler are usually called the
**front-end** part of a compiler, while code-generation occurs in what we call
the **back-end** of a compiler. The front-end is the part of a compiler that is
closer to the input, and it generally indicates machine-independent processing,
such as parsing and static validation. The back-end is the part of a compiler
that is closer to the output, and it generally includes machine-specific
procedures, such as code-generation.
In the **optimizing** compiler, we still decode and translate Wasm binaries to
an intermediate representation in the front-end, but we use a textbook
representation called an **SSA** or "Static Single-Assignment Form", that is
intended for further transformation.
The benefit of choosing an IR that is meant for transformation is that a lot of
optimization passes can apply directly to the IR, and thus be
machine-independent. Then the back-end can be relatively simpler, in that it
will only have to deal with machine-specific concerns.
The wazero optimizing compiler implements the following compilation passes:
* Front-End:
- Translation to SSA
- Optimization
- Block Layout
- Control Flow Analysis
* Back-End:
- Instruction Selection
- Registry Allocation
- Finalization and Encoding
```goat
Input +-------------------+ +-------------------+
Wasm Binary --->| DecodeModule |----->| CompileModule |--+
+-------------------+ +-------------------+ |
+----------------------------------------------------------+
|
| +---------------+ +---------------+
+->| Front-End |----------->| Back-End |
+---------------+ +---------------+
| |
v v
SSA Instruction Selection
| |
v v
Optimization Registry Allocation
| |
v v
Block Layout Finalization/Encoding
```
Like the other engines, the implementation can be found under `engine`, specifically
in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
where the implementation of the interface `wasm.Engine` is found.
All the passes can be dumped to the console for debugging, by enabling, the build-time
flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
by default and should only be enabled during debugging. These may also change in the future.
In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
so we will omit the prefix.
## Index
- [Front-End](frontend/)
- [Back-End](backend/)
- [Appendix](appendix/)

View File

@@ -0,0 +1,185 @@
+++
title = "Appendix: Trampolines"
layout = "single"
+++
Trampolines are used to interface between the Go runtime and the generated
code, in two cases:
- when we need to **enter the generated code** from the Go runtime.
- when we need to **leave the generated code** to invoke a host function
(written in Go).
In this section we want to complete the picture of how a Wasm function gets
translated from Wasm to executable code in the optimizing compiler, by
describing how to jump into the execution of the generated code at run-time.
## Entering the Generated Code
At run-time, user space invokes a Wasm function through the public
`api.Function` interface, using methods `Call()` or `CallWithStack()`. The
implementation of this method, in turn, eventually invokes an ASM
**trampoline**. The signature of this trampoline in Go code is:
```go
func entrypoint(
preambleExecutable, functionExecutable *byte,
executionContextPtr uintptr, moduleContextPtr *byte,
paramResultStackPtr *uint64,
goAllocatedStackSlicePtr uintptr)
```
- `preambleExecutable` is a pointer to the generated code for the preamble (see
below)
- `functionExecutable` is a pointer to the generated code for the function (as
described in the previous sections).
- `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
struct. This struct is used to save the state of the Go runtime before
entering or leaving the generated code. It also holds shared state between the
Go runtime and the generated code, such as the exit code that is used to
terminate execution on failure, or suspend it to invoke host functions.
- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
This struct Its contents are basically the pointers to the module instance,
specific objects as well as functions. This is sometimes called "VMContext" in
other Wasm runtimes.
- `paramResultStackPtr` is a pointer to the slice where the arguments and
results of the function are passed.
- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
for holding values and call frames. For further details refer to
[Backend § Prologue and Epilogue](../backend/#prologue-and-epilogue)
The trampoline can be found in`backend/isa/<arch>/abi_entry_<arch>.s`.
For each given architecture, the trampoline:
- moves the arguments to specific registers to match the behavior of the entry preamble or trampoline function, and
- finally, it jumps into the execution of the generated code for the preamble
The **preamble** that will be jumped from `entrypoint` function is generated per function signature.
This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
The preamble sets the fields in the `wazevo.executionContext`.
At the beginning of the preamble:
- Set a register to point to the `*wazevo.executionContext` struct.
- Save the stack pointers, frame pointers, return addresses, etc. to that
struct.
- Update the stack pointer to point to `paramResultStackPtr`.
The generated code works in concert with the assumption that the preamble has
been entered through the aforementioned trampoline. Thus, it assumes that the
arguments can be found in some specific registers.
The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
the registers and stack location that the generated code expects.
Finally, it invokes the generated code for the function.
The epilogue reverses part of the process, finally returning control to the
caller of the `entrypoint()` function, and the Go runtime. The caller of
`entrypoint()` is also responsible for completing the cleaning up procedure by
invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
backend-specific ASM). which will restore the stack pointers and return
control to the caller of the function.
The arch-specific code can be found in
`backend/isa/<arch>/abi_entry_preamble.go`.
[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
## Leaving the Generated Code
In "[How do compiler functions work?][how-do-compiler-functions-work]", we
already outlined how _leaving_ the generated code works with the help of a
function. We will complete here the picture by briefly describing the code that
is generated.
When the generated code needs to return control to the Go runtime, it inserts a
meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
backends. This meta-instruction sets the `exitCode` in the
`wazevo.executionContext` struct, restore the stack pointers and then returns
control to the caller of the `entrypoint()` function described above.
As described in "[How do compiler functions
work?][how-do-compiler-functions-work]", the mechanism is essentially the same
when invoking a host function or raising an error. However, when a function is
invoked the `exitCode` also indicates the identifier of the host function to be
invoked.
The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
method. This method is actually invoked when host modules are being
instantiated. It generates a trampoline that is used to invoke such functions
from the generated code.
This trampoline implements essentially the same prologue as the `entrypoint()`,
but it also reserves space for the arguments and results of the function to be
invoked.
A host function has the signature:
```
func(ctx context.Context, stack []uint64)
```
the function arguments in the `stack` parameter are copied over to the reserved
slots of the real stack. For instance, on `arm64` the stack layout would look
as follows (on `amd64` it would be similar):
```goat
(high address)
SP ------> +-----------------+ <----+
| ....... | |
| ret Y | |
| ....... | |
| ret 0 | |
| arg X | | size_of_arg_ret
| ....... | |
| arg 1 | |
| arg 0 | <----+ <-------- originalArg0Reg
| size_of_arg_ret |
| ReturnAddress |
+-----------------+ <----+
| xxxx | | ;; might be padded to make it 16-byte aligned.
+--->| arg[N]/ret[M] | |
sliceSize| | ............ | | goCallStackSize
| | arg[1]/ret[1] | |
+--->| arg[0]/ret[0] | <----+ <-------- arg0ret0AddrReg
| sliceSize |
| frame_size |
+-----------------+
(low address)
```
Finally, the trampoline jumps into the execution of the host function using the
`exitSequence` meta-instruction.
Upon return, the process is reversed.
## Code
- The trampoline to enter the generated function is implemented by the
`backend.Machine.CompileEntryPreamble()` method.
- The trampoline to return traps and invoke host functions is generated by
`backend.Machine.CompileGoFunctionTrampoline()` method.
You can find arch-specific implementations in
`backend/isa/<arch>/abi_go_call.go`,
`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found
under `backend/isa/<arch>/abi_entry_<arch>.s`.
## Further References
- Go's [internal ABI documentation][abi-internal] details the calling convention similar to the one we use in both arm64 and amd64 backend.
- Raphael Poss's [The Go low-level calling convention on
x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/

View File

@@ -0,0 +1,507 @@
+++
title = "How the Optimizing Compiler Works: Back-End"
layout = "single"
+++
In this section we will discuss the phases in the back-end of the optimizing
compiler:
- [Instruction Selection](#instruction-selection)
- [Register Allocation](#register-allocation)
- [Finalization and Encoding](#finalization-and-encoding)
Each section will include a brief explanation of the phase, references to the
code that implements the phase, and a description of the debug flags that can
be used to inspect that phase. Please notice that, since the implementation of
the back-end is architecture-specific, the code might be different for each
architecture.
### Code
The higher-level entry-point to the back-end is the
`backend.Compiler.Compile(context.Context)` method. This method executes, in
turn, the following methods in the same type:
- `backend.Compiler.Lower()` (instruction selection)
- `backend.Compiler.RegAlloc()` (register allocation)
- `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
## Instruction Selection
The instruction selection phase is responsible for mapping the higher-level SSA
instructions to arch-specific instructions. Each SSA instruction is translated
to one or more machine instructions.
Each target architecture comes with a different number of registers, some of
them are general purpose, others might be specific to certain instructions. In
general, we can expect to have a set of registers for integer computations,
another set for floating point computations, a set for vector (SIMD)
computations, and some specific special-purpose registers (e.g. stack pointers,
program counters, status flags, etc.)
In addition, some registers might be reserved by the Go runtime or the
Operating System for specific purposes, so they should be handled with special
care.
At this point in the compilation process we do not want to deal with all that.
Instead, we assume that we have a potentially infinite number of *virtual
registers* of each type at our disposal. The next phase, the register
allocation phase, will map these virtual registers to the actual registers of
the target architecture.
### Operands and Addressing Modes
As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
then use that virtual register as one of the arguments of the machine
instruction that we will generate. However, usually instructions are able to
address more than just registers: an *operand* might be able to represent a
memory address, or an immediate value (i.e. a constant value that is encoded as
part of the instruction itself).
For these reasons, instead of mapping each `ssa.Value` to a virtual register
(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
`operand` type.
During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
an argument of the instruction, in the simplest case, the `operand` might be
mapped to a virtual register, in other cases, the `operand` might be mapped to
a memory address, or an immediate value. Sometimes this makes it possible to
replace several SSA instructions with a single machine instruction, by folding
the addressing mode into the instruction itself.
For instance, consider the following SSA instructions:
```
v4:i32 = Const 0x9
v6:i32 = Load v5, 0x4
v7:i32 = Iadd v6, v4
```
In the `amd64` architecture, the `add` instruction adds the second operand to
the first operand, and assigns the result to the second operand. So assuming
that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
instruction on `amd64` might look like this:
```asm
;; AT&T syntax
add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
mov %r4?, %r7? ;; move the result from `r4?` to `r7?`
```
Notice how the load from memory has been folded into an operand of the `add`
instruction. This transformation is possible when the value produced by the
instruction being folded is not referenced by other instructions and the
instructions belong to the same `InstructionGroupID` (see [Front-End:
Optimization](../frontend/#optimization)).
### Example
At the end of the instruction selection phase, the basic blocks of our `abs`
function will look as follows (for `arm64`):
```asm
L1 (SSA Block: blk0):
mov x130?, x2
subs wzr, w130?, #0x0
b.ge L2
L3 (SSA Block: blk1):
mov x136?, xzr
sub w134?, w136?, w130?
mov x135?, x134?
b L4
L2 (SSA Block: blk2):
mov x135?, x130?
L4 (SSA Block: blk3):
mov x0, x135?
ret
```
Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
These are labels that are used to mark the beginning of each basic block, and
they are the target for branching instructions such as `b` and `b.ge`.
### Code
`backend.Machine` is the interface to the backend. It has a methods to
translate (lower) the IR to machine code. Again, as seen earlier in the
front-end, the term *lowering* is used to indicate translation from a
higher-level representation to a lower-level representation.
`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
SSA instruction to machine code. Machine-specific implementations of this
method can be found in package `backend/isa/<arch>` where `<arch>` is either
`amd64` or `arm64`.
### Debug Flags
`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
lowered arch-specific instructions.
## Register Allocation
The register allocation phase is responsible for mapping the potentially
infinite number of virtual registers to the real registers of the target
architecture. Because the number of real registers is limited, the register
allocation phase might need to "spill" some of the virtual registers to memory;
that is, it might store their content, and then load them back into a register
when they are needed.
For a given function `f` the register allocation procedure
`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
- `livenessAnalysis(f)` collects the "liveness" information for each virtual
register. The algorithm is described in [Chapter 9.2 of The SSA
Book][ssa-book].
- `alloc(f)` allocates registers for the given function. The algorithm is
derived from [the Go compiler's
allocator][go-regalloc]
At the end of the allocation procedure, we also record the set of registers
that are **clobbered** by the body of the function. A register is clobbered
if its value is overwritten by the function, and it is not saved by the
callee. This information is used in the finalization phase to determine which
registers need to be saved in the prologue and restored in the epilogue.
to register allocation in a textbook meaning, but it is a necessary step
for the finalization phase.
### Liveness Analysis
Intuitively, a variable or name binding can be considered _live_ at a certain
point in a program, if its value will be used in the future.
For instance:
```
1| int f(int x) {
2| int y = 2 + x;
3| int z = x + y;
4| return z;
5| }
```
Variable `x` and `y` are both live at line 4, because they are used in the
expression `x + y` on line 3; variable `z` is live at line 4, because it is
used in the return statement. However, variables `x` and `y` can be considered
_not_ live at line 4 because they are not used anywhere after line 3.
Statically, _liveness_ can be approximated by following paths backwards on the
control-flow graph, connecting the uses of a given variable to its definitions
(or its *unique* definition, assuming SSA form).
In practice, while liveness is a property of each name binding at any point in
the program, it is enough to keep track of liveness at the boundaries of basic
blocks:
- the _live-in_ set for a given basic block is the set of all bindings that are
live at the entry of that block.
- the _live-out_ set for a given basic block is the set of all bindings that
are live at the exit of that block. A binding is live at the exit of a block
if it is live at the entry of a successor.
Because the CFG is a connected graph, it is enough to keep track of either
live-in or live-out sets, and then propagate the liveness information backward
or forward, respectively. In our case, we keep track of live-in sets per block;
live-outs are derived from live-ins of the successor blocks when a block is
allocated.
### Allocation
We implemented a variant of the linear scan register allocation algorithm
described in [the Go compiler's allocator][go-regalloc].
Each basic block is allocated registers in a linear scan order, and the
allocation state is propagated from a given basic block to its successors.
Then, each block continues allocation from that initial state.
#### Merge States
Special care has to be taken when a block has multiple predecessors. We call
this *fixing merge states*: for instance, consider the following:
```goat { width="30%" }
.---. .---.
| BB0 | | BB1 |
'-+-' '-+-'
+----+----+
|
v
.---.
| BB2 |
'---'
```
if the live-out set of a given block `BB0` is different from the live-out set
of a given block `BB1` and both are predecessors of a block `BB2`, then we need
to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers
or via stack; fixing merge states ensures that registers and stack are used
consistently to pass values across the involved states.
#### Spilling
If the register allocator cannot find a free register for a given virtual
(live) register, it needs to "spill" the value to the stack to get a free
register, *i.e.,* stash it temporarily to stack. When that virtual register is
reused later, we will have to insert instructions to reload the value into a
real register.
While the procedure proceeds with allocation, the procedure also records all
the virtual registers that transition to the "spilled" state, and inserts the
reload instructions when those registers are reused later.
The spill instructions are actually inserted at the end of the register
allocation, after all the allocations and the merge states have been fixed. At
this point, all the other potential sources of instability have been resolved,
and we know where all the reloads happen.
We insert the spills in the block that is the lowest common ancestor of all the
blocks that reload the value.
#### Clobbered Registers
At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
method iterates over the set of the allocated registers and compares them
to a set of architecture-specific set `CalleeSavedRegisters`. If a register
has been allocated, and it is present in this set, the register is marked as
"clobbered", i.e., we now know that the register allocator will overwrite
that value. Thus, these values will have to be spilled in the prologue.
#### References
Register allocation is a complex problem, possibly the most complicated
part of the backend. The following references were used to implement the
algorithm:
- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
We suggest to refer to them to dive deeper in the topic.
### Example
At the end of the register allocation phase, the basic blocks of our `abs`
function look as follows (for `arm64`):
```asm
L1 (SSA Block: blk0):
mov x2, x2
subs wzr, w2, #0x0
b.ge L2
L3 (SSA Block: blk1):
mov x8, xzr
sub w8, w8, w2
mov x8, x8
b L4
L2 (SSA Block: blk2):
mov x8, x2
L4 (SSA Block: blk3):
mov x0, x8
ret
```
Notice how the virtual registers have been all replaced by real registers, i.e.
no register identifier is suffixed with `?`. This example is quite simple, and
it does not require any spill.
### Code
The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
interfaces in `regalloc/api.go`.
Essentially:
- each architecture exposes iteration over basic blocks of a function
(`regalloc.Function` interface)
- each arch-specific basic block exposes iteration over instructions
(`regalloc.Block` interface)
- each arch-specific instruction exposes the set of registers it defines and
uses (`regalloc.Instr` interface)
By defining these interfaces, the register allocation algorithm can assign real
registers to virtual registers without dealing specifically with the target
architecture.
In practice, each interface is usually implemented by instantiating a common
generic struct that comes already with an implementation of all or most of the
required methods. For instance,`regalloc.Function`is implemented by
`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
`backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
contains the instantiation of the `regalloc.RegisterInfo` struct, which
declares, among others
- the set of registers that are available for allocation, excluding, for
instance, those that might be reserved by the runtime or the OS
(`AllocatableRegisters`)
- the registers that might be saved by the callee to the stack
(`CalleeSavedRegisters`)
### Debug Flags
- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
allocation procedure.
- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
allocation result.
## Finalization and Encoding
At the end of the register allocation phase, we have enough information to
finally generate machine code (_encoding_). We are only missing the prologue
and epilogue of the function.
### Prologue and Epilogue
As usual, the **prologue** is executed before the main body of the function,
and the **epilogue** is executed at the return. The prologue is responsible for
setting up the stack frame, and the epilogue is responsible for cleaning up the
stack frame and returning control to the caller.
Generally, this means, at the very least:
- saving the return address
- a base pointer to the stack; or, equivalently, the height of the stack at the
beginning of the function
For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
pointer:
```goat {width="100%" height="250"}
(high address) (high address)
RBP ----> +-----------------+ +-----------------+
| `...` | | `...` |
| ret Y | | ret Y |
| `...` | | `...` |
| ret 0 | | ret 0 |
| arg X | | arg X |
| `...` | ====> | `...` |
| arg 1 | | arg 1 |
| arg 0 | | arg 0 |
| Return Addr | | Return Addr |
RSP ----> +-----------------+ | Caller_RBP |
(low address) +-----------------+ <----- RSP, RBP
```
While, on `arm64`, there is only a stack pointer `SP`:
```goat {width="100%" height="300"}
(high address) (high address)
SP ---> +-----------------+ +------------------+ <----+
| `...` | | `...` | |
| ret Y | | ret Y | |
| `...` | | `...` | |
| ret 0 | | ret 0 | |
| arg X | | arg X | | size_of_arg_ret.
| `...` | ====> | `...` | |
| arg 1 | | arg 1 | |
| arg 0 | | arg 0 | <----+
+-----------------+ | size_of_arg_ret |
| return address |
+------------------+ <---- SP
(low address) (low address)
```
However, the prologue and epilogue might also be responsible for saving and
restoring the state of registers that might be overwritten by the function
("clobbered"); and, if spilling occurs, prologue and epilogue are also
responsible for reserving and releasing the space for the spilled values.
For clarity, we make a distinction between the space reserved for the clobbered
registers and the space reserved for the spilled values:
- Spill slots are used to temporarily store the values that needs spilling as
determined by the register allocator. This section must have a fix height,
but its contents will change over time, as registers are being spilled and
reloaded.
- Clobbered registers are, similarly, determined by the register allocator, but
they are stashed in the prologue and then restored in the epilogue.
The procedure happens after the register allocation phase because at
this point we have collected enough information to know how much space we need
to reserve, and which registers are clobbered.
Regardless of the architecture, after allocating this space, the stack will
look as follows:
```goat {height="350"}
(high address)
+-----------------+
| `...` |
| ret Y |
| `...` |
| ret 0 |
| arg X |
| `...` |
| arg 1 |
| arg 0 |
| (arch-specific) |
+-----------------+
| clobbered M |
| ............ |
| clobbered 1 |
| clobbered 0 |
| spill slot N |
| ............ |
| spill slot 0 |
+-----------------+
(low address)
```
Note: the prologue might also introduce a check of the stack bounds. If there
is no sufficient space to allocate the stack frame, the function will exit the
execution and will try to grow it from the Go runtime.
The epilogue simply reverses the operations of the prologue.
### Other Post-RegAlloc Logic
The `backend.Machine.PostRegAlloc` method is invoked after the register
allocation procedure; while its main role is to define the prologue and
epilogue of the function, it also serves as a hook to perform other,
arch-specific duty, that has to happen after the register allocation phase.
For instance, on `amd64`, the constraints for some instructions are hard to
express in a meaningful way for the register allocation procedure (for
instance, the `div` instruction implicitly use registers `rdx`, `rax`).
Instead, they are lowered with ad-hoc logic as part of the implementation
`backend.Machine.PostRegAlloc` method.
### Encoding
The final stage of the backend encodes the machine instructions into bytes and
writes them to the target buffer. Before proceeding with the encoding, relative
addresses in branching instructions or addressing modes are resolved.
The procedure encodes the instructions in the order they appear in the
function.
### Code
- The prologue and epilogue are set up as part of the
`backend.Machine.PostRegAlloc` method.
- The encoding is done by the `backend.Machine.Encode` method.
### Debug Flags
- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
function after the finalization phase.
- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
representation of the function generated code as it is.
- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
representation of the function generated code that can be disassembled.
The reason for the distinction between the last two flags is that the generated
code in some cases might not be disassemblable.
`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
the generated code that can be disassembled, but cannot be executed.
<hr>
* Previous Section: [Front-End](../frontend/)
* Next Section: [Appendix: Trampolines](../appendix/)
[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go

View File

@@ -0,0 +1,371 @@
+++
title = "How the Optimizing Compiler Works: Front-End"
layout = "single"
+++
In this section we will discuss the phases in the front-end of the optimizing compiler:
- [Translation to SSA](#translation-to-ssa)
- [Optimization](#optimization)
- [Block Layout](#block-layout)
Every section includes an explanation of the phase; the subsection **Code**
will include high-level pointers to functions and packages; the subsection **Debug Flags**
indicates the flags that can be used to enable advanced logging of the phase.
## Translation to SSA
We mentioned earlier that wazero uses an internal representation called an "SSA"
form or "Static Single-Assignment" form, but we never explained what that is.
In short terms, every program, or, in our case, every Wasm function, can be
translated in a control-flow graph. The control-flow graph is a directed graph where
each node is a sequence of statements that do not contain a control flow instruction,
called a **basic block**. Instead, control-flow instructions are translated into edges.
For instance, take the following implementation of the `abs` function:
```wasm
(module
(func (;0;) (param i32) (result i32)
(if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
(then
(i32.sub (i32.const 0) (local.get 0)))
(else
(local.get 0))
)
)
(export "f" (func 0))
)
```
This is translated to the following block diagram:
```goat {width="100%" height="500"}
+---------------------------------------------+
|blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
| v3:i32 = Iconst_32 0x0 |
| v4:i32 = Icmp lt_s, v2, v3 |
| Brz v4, blk2 |
| Jump blk1 |
+---------------------------------------------+
|
|
+---`(v4 != 0)`-+-`(v4 == 0)`---+
| |
v v
+---------------------------+ +---------------------------+
|blk1: () <-- (blk0) | |blk2: () <-- (blk0) |
| v6:i32 = Iconst_32 0x0 | | Jump blk3, v2 |
| v7:i32 = Isub v6, v2 | | |
| Jump blk3, v7 | | |
+---------------------------+ +---------------------------+
| |
| |
+-`{v5 := v7}`--+--`{v5 := v2}`-+
|
v
+------------------------------+
|blk3: (v5:i32) <-- (blk1,blk2)|
| Jump blk_ret, v5 |
+------------------------------+
|
{return v5}
|
v
```
We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
Jump, etc...) with an optional list of arguments; these arguments are assigned
to the target block's arguments like a function.
Consider the first block `blk0`.
```
blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
v3:i32 = Iconst_32 0x0
v4:i32 = Icmp lt_s, v2, v3
Brz v4, blk2
Jump blk1
```
You will notice that, compared to the original function, it takes two extra
parameters (`exec_ctx` and `module_ctx`):
1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
in the face of traps or for host function calls.
2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
to access memory.
It then takes one parameter `v2`, corresponding to the function parameter, and
it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
`blk2` if `v4` is zero, otherwise it jumps to `blk1`.
You might also have noticed that the instructions do not correspond strictly to
the original Wasm opcodes. This is because, similarly to the wazero IR used by
the old compiler, this is a custom IR.
You will also notice that, _on the right-hand side of the assignments_ of any statement,
no name occurs _twice_: this is why this form is called **single-assignment**.
Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
```
blk1: ()
...
Jump blk3, v7
blk2: ()
Jump blk3, v2
blk3: (v5:i32)
...
```
`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
depending on the originating block. If you are familiar with the traditional
representation of an SSA form, you will recognize that the role of block
arguments is equivalent to the role of the *Phi (Φ) function*, a special
function that returns a different value depending on the incoming edge; e.g., in
this case: `v5 := Φ(v7, v2)`.
### Code
The relevant APIs can be found under sub-package `ssa` and `frontend`.
In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
because such transformations usually correspond to targeting a lower abstraction level.
- Basic Blocks are represented by the type `ssa.Block`.
- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
in the context of `wasm.Engine.CompileModule()`, more specifically in the method
`frontend.Compiler.LowerToSSA()`.
- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
- Because they are semantically equivalent, in the code, basic block parameters
are sometimes referred to as "Phi values".
#### Instructions and Values
An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
instructions may not produce any value (for instance, a `Jump` instruction).
An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
to represent the result of an instruction, or the input to an instruction.
For instance:
```
blk1: () <-- (blk0)
v6:i32 = Iconst_32 0x0
v7:i32 = Isub v6, v2
Jump blk3, v7
```
`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
- `i32`: 32-bit integer
- `i64`: 64-bit integer
- `f32`: 32-bit floating point
- `f64`: 64-bit floating point
- `v128`: 128-bit SIMD vector
For simplicity, we don't have a dedicated type for pointers. Instead, we use the `i64`
type to represent pointer values since we only support 64-bit architectures,
unlike traditional compilers such as LLVM.
Values and instructions are both allocated from pools to minimize memory allocations.
### Debug Flags
- `wazevoapi.PrintSSA` dumps the SSA form to the console.
- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
opcodes and SSA instructions to the console.
## Optimization
The SSA form makes it easier to perform a number of optimizations. For instance,
we can perform constant propagation, dead code elimination, and common
subexpression elimination. These optimizations either act upon the instructions
within a basic block, or they act upon the control-flow graph as a whole.
On a high, level, consider the following basic block, derived from the previous
example:
```
blk0: (exec_ctx:i64, module_ctx:i64)
v2:i32 = Iconst_32 -5
v3:i32 = Iconst_32 0
v4:i32 = Icmp lt_s, v2, v3
Brz v4, blk2
Jump blk1
```
It is pretty easy to see that the comparison in `v4` can be replaced by a
constant `1`, because the comparison is between two constant values (-5, 0).
Therefore, the block can be rewritten as such:
```
blk0: (exec_ctx:i64, module_ctx:i64)
v4:i32 = Iconst_32 1
Brz v4, blk2
Jump blk1
```
However, we can now also see that the branch is always taken, and that the block
`blk2` is never executed, so even the branch instruction and the constant
definition `v4` can be removed:
```
blk0: (exec_ctx:i64, module_ctx:i64)
Jump blk1
```
This is a simple example of constant propagation and dead code elimination
occurring within a basic block. However, now `blk2` is unreachable, because
there is no other edge in the edge that points to it; thus it can be removed
from the control-flow graph. This is an example of dead-code elimination that
occurs at the control-flow graph level.
In practice, because WebAssembly is a compilation target, these simple
optimizations are often unnecessary. The optimization passes implemented in
wazero are also work-in-progress and, at the time of writing, further work is
expected to implement more advanced optimizations.
### Code
Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
pass is just a function that takes a ssa builder as a parameter.
Passes iterate over the basic blocks, and, for each basic block, they iterate
over the instructions. Each pass may mutate the basic block by modifying the instructions
it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
blocks).
Currently, there are two dead-code elimination passes:
- `passDeadBlockEliminationOpt` acting at the block-level.
- `passDeadCodeEliminationOpt` acting at instruction-level.
Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
instruction. This is used to determine whether a sequence of instructions can be
replaced by a single machine instruction during the back-end phase. For more details,
see also the relevant documentation in `ssa/instructions.go`
There are also simple constant folding passes such as `passNopInstElimination`, which
folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
### Debug Flags
`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
## Block Layout
As we have seen earlier, the SSA form instructions are contained within basic
blocks, and the basic blocks are connected by edges of the control-flow graph.
However, machine code is not laid out in a graph, but it is just a linear
sequence of instructions.
Thus, the last step of the front-end is to lay out the basic blocks in a linear
sequence. Because each basic block, by design, ends with a control-flow
instruction, one of the goals of the block layout phase is to maximize the number of
**fall-through opportunities**. A fall-through opportunity occurs when a block ends
with a jump instruction whose target is exactly the next block in the
sequence. In order to maximize the number of fall-through opportunities, the
block layout phase might reorder the basic blocks in the control-flow graph,
and transform the control-flow instructions. For instance, it might _invert_
some branching conditions.
The end goal is to effectively minimize the number of jumps and branches in
the machine code that will be generated later.
### Critical Edges
Special attention must be taken when a basic block has multiple predecessors,
i.e., when it has multiple incoming edges. In particular, an edge between two
basic blocks is called a **critical edge** when, at the same time:
- the predecessor has multiple successors **and**
- the successor has multiple predecessors.
For instance, in the example below the edge between `BB0` and `BB3`
is a critical edge.
```goat { width="300" }
┌───────┐ ┌───────┐
│ BB0 │━┓ │ BB1 │
└───────┘ ┃ └───────┘
│ ┃ │
▼ ┃ ▼
┌───────┐ ┃ ┌───────┐
│ BB2 │ ┗━▶│ BB3 │
└───────┘ └───────┘
```
In these cases the critical edge is split by introducing a new basic block,
called a **trampoline**, where the critical edge was.
```goat { width="300" }
┌───────┐ ┌───────┐
│ BB0 │──────┐ │ BB1 │
└───────┘ ▼ └───────┘
│ ┌──────────┐ │
│ │trampoline│ │
▼ └──────────┘ ▼
┌───────┐ │ ┌───────┐
│ BB2 │ └────▶│ BB3 │
└───────┘ └───────┘
```
For more details on critical edges read more at
- https://en.wikipedia.org/wiki/Control-flow_graph
- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
### Example
At the end of the block layout phase, the laid out SSA for the `abs` function
looks as follows:
```
blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
v3:i32 = Iconst_32 0x0
v4:i32 = Icmp lt_s, v2, v3
Brz v4, blk2
Jump fallthrough
blk1: () <-- (blk0)
v6:i32 = Iconst_32 0x0
v7:i32 = Isub v6, v2
Jump blk3, v7
blk2: () <-- (blk0)
Jump fallthrough, v2
blk3: (v5:i32) <-- (blk1,blk2)
Jump blk_ret, v5
```
### Code
`passLayoutBlocks` implements the block layout phase.
### Debug Flags
- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
such as inverting branching conditions or splitting critical edges.
<hr>
* Previous Section: [How the Optimizing Compiler Works](../)
* Next Section: [Back-End](../backend/)
[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes