wazevo(docs): optimizing compiler (#2065)
Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
This commit is contained in:
@@ -143,7 +143,8 @@ Notably, the interpreter and compiler in wazero's [Runtime configuration][Runtim
|
||||
In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT)
|
||||
during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute
|
||||
natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of
|
||||
magnitude (10x) or more, and therefore enabled by default whenever available.
|
||||
magnitude (10x) or more, and therefore enabled by default whenever available. You can read more about wazero's
|
||||
[optimizing compiler in the detailed documentation]({{< relref "/how_the_optimizing_compiler_works" >}}).
|
||||
|
||||
#### Interpreter
|
||||
|
||||
|
||||
131
site/content/docs/how_the_optimizing_compiler_works/_index.md
Normal file
131
site/content/docs/how_the_optimizing_compiler_works/_index.md
Normal file
@@ -0,0 +1,131 @@
|
||||
+++
|
||||
title = "How the Optimizing Compiler Works"
|
||||
layout = "single"
|
||||
+++
|
||||
|
||||
wazero supports two modes of execution: interpreter mode and compilation mode.
|
||||
The interpreter mode is a fallback mode for platforms where compilation is not
|
||||
supported. Compilation mode is otherwise the default mode of execution: it
|
||||
translates Wasm modules to native code to get the best run-time performance.
|
||||
|
||||
Translating Wasm bytecode into machine code can take multiple forms. wazero
|
||||
1.0 performs a straightforward translation from a given instruction to a native
|
||||
instruction. wazero 2.0 introduces an optimizing compiler that is able to
|
||||
perform nontrivial optimizing transformations, such as constant folding or
|
||||
dead-code elimination, and it makes better use of the underlying hardware, such
|
||||
as CPU registers. This document digs deeper into what we mean when we say
|
||||
"optimizing compiler", and explains how it is implemented in wazero.
|
||||
|
||||
This document is intended for maintainers, researchers, developers and in
|
||||
general anyone interested in understanding the internals of wazero.
|
||||
|
||||
What is an Optimizing Compiler?
|
||||
-------------------------------
|
||||
|
||||
Wazero supports an _optimizing_ compiler in the style of other optimizing
|
||||
compilers such as LLVM's or V8's. Traditionally an optimizing
|
||||
compiler performs compilation in a number of steps.
|
||||
|
||||
Compare this to the **old compiler**, where compilation happens in one step or
|
||||
two, depending on how you count:
|
||||
|
||||
|
||||
```goat
|
||||
Input +---------------+ +---------------+
|
||||
Wasm Binary ---->| DecodeModule |---->| CompileModule |----> wazero IR
|
||||
+---------------+ +---------------+
|
||||
```
|
||||
|
||||
That is, the module is (1) validated then (2) translated to an Intermediate
|
||||
Representation (IR). The wazero IR can then be executed directly (in the case
|
||||
of the interpreter) or it can be further processed and translated into native
|
||||
code by the compiler. This compiler performs a straightforward translation from
|
||||
the IR to native code, without any further passes. The wazero IR is not intended
|
||||
for further processing beyond immediate execution or straightforward
|
||||
translation.
|
||||
|
||||
```goat
|
||||
+---- wazero IR ----+
|
||||
| |
|
||||
v v
|
||||
+--------------+ +--------------+
|
||||
| Compiler | | Interpreter |- - - executable
|
||||
+--------------+ +--------------+
|
||||
|
|
||||
+----------+---------+
|
||||
| |
|
||||
v v
|
||||
+---------+ +---------+
|
||||
| ARM64 | | AMD64 |
|
||||
| Backend | | Backend | - - - - - - - - - executable
|
||||
+---------+ +---------+
|
||||
```
|
||||
|
||||
|
||||
Validation and translation to an IR in a compiler are usually called the
|
||||
**front-end** part of a compiler, while code-generation occurs in what we call
|
||||
the **back-end** of a compiler. The front-end is the part of a compiler that is
|
||||
closer to the input, and it generally indicates machine-independent processing,
|
||||
such as parsing and static validation. The back-end is the part of a compiler
|
||||
that is closer to the output, and it generally includes machine-specific
|
||||
procedures, such as code-generation.
|
||||
|
||||
In the **optimizing** compiler, we still decode and translate Wasm binaries to
|
||||
an intermediate representation in the front-end, but we use a textbook
|
||||
representation called an **SSA** or "Static Single-Assignment Form", that is
|
||||
intended for further transformation.
|
||||
|
||||
The benefit of choosing an IR that is meant for transformation is that a lot of
|
||||
optimization passes can apply directly to the IR, and thus be
|
||||
machine-independent. Then the back-end can be relatively simpler, in that it
|
||||
will only have to deal with machine-specific concerns.
|
||||
|
||||
The wazero optimizing compiler implements the following compilation passes:
|
||||
|
||||
* Front-End:
|
||||
- Translation to SSA
|
||||
- Optimization
|
||||
- Block Layout
|
||||
- Control Flow Analysis
|
||||
|
||||
* Back-End:
|
||||
- Instruction Selection
|
||||
- Registry Allocation
|
||||
- Finalization and Encoding
|
||||
|
||||
```goat
|
||||
Input +-------------------+ +-------------------+
|
||||
Wasm Binary --->| DecodeModule |----->| CompileModule |--+
|
||||
+-------------------+ +-------------------+ |
|
||||
+----------------------------------------------------------+
|
||||
|
|
||||
| +---------------+ +---------------+
|
||||
+->| Front-End |----------->| Back-End |
|
||||
+---------------+ +---------------+
|
||||
| |
|
||||
v v
|
||||
SSA Instruction Selection
|
||||
| |
|
||||
v v
|
||||
Optimization Registry Allocation
|
||||
| |
|
||||
v v
|
||||
Block Layout Finalization/Encoding
|
||||
```
|
||||
|
||||
Like the other engines, the implementation can be found under `engine`, specifically
|
||||
in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
|
||||
where the implementation of the interface `wasm.Engine` is found.
|
||||
|
||||
All the passes can be dumped to the console for debugging, by enabling, the build-time
|
||||
flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
|
||||
by default and should only be enabled during debugging. These may also change in the future.
|
||||
|
||||
In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
|
||||
so we will omit the prefix.
|
||||
|
||||
## Index
|
||||
|
||||
- [Front-End](frontend/)
|
||||
- [Back-End](backend/)
|
||||
- [Appendix](appendix/)
|
||||
185
site/content/docs/how_the_optimizing_compiler_works/appendix.md
Normal file
185
site/content/docs/how_the_optimizing_compiler_works/appendix.md
Normal file
@@ -0,0 +1,185 @@
|
||||
+++
|
||||
title = "Appendix: Trampolines"
|
||||
layout = "single"
|
||||
+++
|
||||
|
||||
Trampolines are used to interface between the Go runtime and the generated
|
||||
code, in two cases:
|
||||
|
||||
- when we need to **enter the generated code** from the Go runtime.
|
||||
- when we need to **leave the generated code** to invoke a host function
|
||||
(written in Go).
|
||||
|
||||
In this section we want to complete the picture of how a Wasm function gets
|
||||
translated from Wasm to executable code in the optimizing compiler, by
|
||||
describing how to jump into the execution of the generated code at run-time.
|
||||
|
||||
## Entering the Generated Code
|
||||
|
||||
At run-time, user space invokes a Wasm function through the public
|
||||
`api.Function` interface, using methods `Call()` or `CallWithStack()`. The
|
||||
implementation of this method, in turn, eventually invokes an ASM
|
||||
**trampoline**. The signature of this trampoline in Go code is:
|
||||
|
||||
```go
|
||||
func entrypoint(
|
||||
preambleExecutable, functionExecutable *byte,
|
||||
executionContextPtr uintptr, moduleContextPtr *byte,
|
||||
paramResultStackPtr *uint64,
|
||||
goAllocatedStackSlicePtr uintptr)
|
||||
```
|
||||
|
||||
- `preambleExecutable` is a pointer to the generated code for the preamble (see
|
||||
below)
|
||||
- `functionExecutable` is a pointer to the generated code for the function (as
|
||||
described in the previous sections).
|
||||
- `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
|
||||
struct. This struct is used to save the state of the Go runtime before
|
||||
entering or leaving the generated code. It also holds shared state between the
|
||||
Go runtime and the generated code, such as the exit code that is used to
|
||||
terminate execution on failure, or suspend it to invoke host functions.
|
||||
- `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
|
||||
This struct Its contents are basically the pointers to the module instance,
|
||||
specific objects as well as functions. This is sometimes called "VMContext" in
|
||||
other Wasm runtimes.
|
||||
- `paramResultStackPtr` is a pointer to the slice where the arguments and
|
||||
results of the function are passed.
|
||||
- `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
|
||||
for holding values and call frames. For further details refer to
|
||||
[Backend § Prologue and Epilogue](../backend/#prologue-and-epilogue)
|
||||
|
||||
The trampoline can be found in`backend/isa/<arch>/abi_entry_<arch>.s`.
|
||||
|
||||
For each given architecture, the trampoline:
|
||||
- moves the arguments to specific registers to match the behavior of the entry preamble or trampoline function, and
|
||||
- finally, it jumps into the execution of the generated code for the preamble
|
||||
|
||||
The **preamble** that will be jumped from `entrypoint` function is generated per function signature.
|
||||
|
||||
This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
|
||||
|
||||
The preamble sets the fields in the `wazevo.executionContext`.
|
||||
|
||||
At the beginning of the preamble:
|
||||
|
||||
- Set a register to point to the `*wazevo.executionContext` struct.
|
||||
- Save the stack pointers, frame pointers, return addresses, etc. to that
|
||||
struct.
|
||||
- Update the stack pointer to point to `paramResultStackPtr`.
|
||||
|
||||
The generated code works in concert with the assumption that the preamble has
|
||||
been entered through the aforementioned trampoline. Thus, it assumes that the
|
||||
arguments can be found in some specific registers.
|
||||
|
||||
The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
|
||||
the registers and stack location that the generated code expects.
|
||||
|
||||
Finally, it invokes the generated code for the function.
|
||||
|
||||
The epilogue reverses part of the process, finally returning control to the
|
||||
caller of the `entrypoint()` function, and the Go runtime. The caller of
|
||||
`entrypoint()` is also responsible for completing the cleaning up procedure by
|
||||
invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
|
||||
backend-specific ASM). which will restore the stack pointers and return
|
||||
control to the caller of the function.
|
||||
|
||||
The arch-specific code can be found in
|
||||
`backend/isa/<arch>/abi_entry_preamble.go`.
|
||||
|
||||
[wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
|
||||
[abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
|
||||
[abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
|
||||
[abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
|
||||
|
||||
|
||||
## Leaving the Generated Code
|
||||
|
||||
In "[How do compiler functions work?][how-do-compiler-functions-work]", we
|
||||
already outlined how _leaving_ the generated code works with the help of a
|
||||
function. We will complete here the picture by briefly describing the code that
|
||||
is generated.
|
||||
|
||||
When the generated code needs to return control to the Go runtime, it inserts a
|
||||
meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
|
||||
backends. This meta-instruction sets the `exitCode` in the
|
||||
`wazevo.executionContext` struct, restore the stack pointers and then returns
|
||||
control to the caller of the `entrypoint()` function described above.
|
||||
|
||||
As described in "[How do compiler functions
|
||||
work?][how-do-compiler-functions-work]", the mechanism is essentially the same
|
||||
when invoking a host function or raising an error. However, when a function is
|
||||
invoked the `exitCode` also indicates the identifier of the host function to be
|
||||
invoked.
|
||||
|
||||
The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
|
||||
method. This method is actually invoked when host modules are being
|
||||
instantiated. It generates a trampoline that is used to invoke such functions
|
||||
from the generated code.
|
||||
|
||||
This trampoline implements essentially the same prologue as the `entrypoint()`,
|
||||
but it also reserves space for the arguments and results of the function to be
|
||||
invoked.
|
||||
|
||||
A host function has the signature:
|
||||
|
||||
```
|
||||
func(ctx context.Context, stack []uint64)
|
||||
```
|
||||
|
||||
the function arguments in the `stack` parameter are copied over to the reserved
|
||||
slots of the real stack. For instance, on `arm64` the stack layout would look
|
||||
as follows (on `amd64` it would be similar):
|
||||
|
||||
```goat
|
||||
(high address)
|
||||
SP ------> +-----------------+ <----+
|
||||
| ....... | |
|
||||
| ret Y | |
|
||||
| ....... | |
|
||||
| ret 0 | |
|
||||
| arg X | | size_of_arg_ret
|
||||
| ....... | |
|
||||
| arg 1 | |
|
||||
| arg 0 | <----+ <-------- originalArg0Reg
|
||||
| size_of_arg_ret |
|
||||
| ReturnAddress |
|
||||
+-----------------+ <----+
|
||||
| xxxx | | ;; might be padded to make it 16-byte aligned.
|
||||
+--->| arg[N]/ret[M] | |
|
||||
sliceSize| | ............ | | goCallStackSize
|
||||
| | arg[1]/ret[1] | |
|
||||
+--->| arg[0]/ret[0] | <----+ <-------- arg0ret0AddrReg
|
||||
| sliceSize |
|
||||
| frame_size |
|
||||
+-----------------+
|
||||
(low address)
|
||||
```
|
||||
|
||||
Finally, the trampoline jumps into the execution of the host function using the
|
||||
`exitSequence` meta-instruction.
|
||||
|
||||
Upon return, the process is reversed.
|
||||
|
||||
## Code
|
||||
|
||||
- The trampoline to enter the generated function is implemented by the
|
||||
`backend.Machine.CompileEntryPreamble()` method.
|
||||
- The trampoline to return traps and invoke host functions is generated by
|
||||
`backend.Machine.CompileGoFunctionTrampoline()` method.
|
||||
|
||||
You can find arch-specific implementations in
|
||||
`backend/isa/<arch>/abi_go_call.go`,
|
||||
`backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found
|
||||
under `backend/isa/<arch>/abi_entry_<arch>.s`.
|
||||
|
||||
## Further References
|
||||
|
||||
- Go's [internal ABI documentation][abi-internal] details the calling convention similar to the one we use in both arm64 and amd64 backend.
|
||||
- Raphael Poss's [The Go low-level calling convention on
|
||||
x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
|
||||
|
||||
[abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
|
||||
[go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
|
||||
[proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
|
||||
[how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/
|
||||
|
||||
507
site/content/docs/how_the_optimizing_compiler_works/backend.md
Normal file
507
site/content/docs/how_the_optimizing_compiler_works/backend.md
Normal file
@@ -0,0 +1,507 @@
|
||||
+++
|
||||
title = "How the Optimizing Compiler Works: Back-End"
|
||||
layout = "single"
|
||||
+++
|
||||
|
||||
In this section we will discuss the phases in the back-end of the optimizing
|
||||
compiler:
|
||||
|
||||
- [Instruction Selection](#instruction-selection)
|
||||
- [Register Allocation](#register-allocation)
|
||||
- [Finalization and Encoding](#finalization-and-encoding)
|
||||
|
||||
Each section will include a brief explanation of the phase, references to the
|
||||
code that implements the phase, and a description of the debug flags that can
|
||||
be used to inspect that phase. Please notice that, since the implementation of
|
||||
the back-end is architecture-specific, the code might be different for each
|
||||
architecture.
|
||||
|
||||
### Code
|
||||
|
||||
The higher-level entry-point to the back-end is the
|
||||
`backend.Compiler.Compile(context.Context)` method. This method executes, in
|
||||
turn, the following methods in the same type:
|
||||
|
||||
- `backend.Compiler.Lower()` (instruction selection)
|
||||
- `backend.Compiler.RegAlloc()` (register allocation)
|
||||
- `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
|
||||
|
||||
## Instruction Selection
|
||||
|
||||
The instruction selection phase is responsible for mapping the higher-level SSA
|
||||
instructions to arch-specific instructions. Each SSA instruction is translated
|
||||
to one or more machine instructions.
|
||||
|
||||
Each target architecture comes with a different number of registers, some of
|
||||
them are general purpose, others might be specific to certain instructions. In
|
||||
general, we can expect to have a set of registers for integer computations,
|
||||
another set for floating point computations, a set for vector (SIMD)
|
||||
computations, and some specific special-purpose registers (e.g. stack pointers,
|
||||
program counters, status flags, etc.)
|
||||
|
||||
In addition, some registers might be reserved by the Go runtime or the
|
||||
Operating System for specific purposes, so they should be handled with special
|
||||
care.
|
||||
|
||||
At this point in the compilation process we do not want to deal with all that.
|
||||
Instead, we assume that we have a potentially infinite number of *virtual
|
||||
registers* of each type at our disposal. The next phase, the register
|
||||
allocation phase, will map these virtual registers to the actual registers of
|
||||
the target architecture.
|
||||
|
||||
### Operands and Addressing Modes
|
||||
|
||||
As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
|
||||
then use that virtual register as one of the arguments of the machine
|
||||
instruction that we will generate. However, usually instructions are able to
|
||||
address more than just registers: an *operand* might be able to represent a
|
||||
memory address, or an immediate value (i.e. a constant value that is encoded as
|
||||
part of the instruction itself).
|
||||
|
||||
For these reasons, instead of mapping each `ssa.Value` to a virtual register
|
||||
(`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
|
||||
`operand` type.
|
||||
|
||||
During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
|
||||
an argument of the instruction, in the simplest case, the `operand` might be
|
||||
mapped to a virtual register, in other cases, the `operand` might be mapped to
|
||||
a memory address, or an immediate value. Sometimes this makes it possible to
|
||||
replace several SSA instructions with a single machine instruction, by folding
|
||||
the addressing mode into the instruction itself.
|
||||
|
||||
For instance, consider the following SSA instructions:
|
||||
|
||||
```
|
||||
v4:i32 = Const 0x9
|
||||
v6:i32 = Load v5, 0x4
|
||||
v7:i32 = Iadd v6, v4
|
||||
```
|
||||
|
||||
In the `amd64` architecture, the `add` instruction adds the second operand to
|
||||
the first operand, and assigns the result to the second operand. So assuming
|
||||
that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
|
||||
registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
|
||||
instruction on `amd64` might look like this:
|
||||
|
||||
```asm
|
||||
;; AT&T syntax
|
||||
add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
|
||||
mov %r4?, %r7? ;; move the result from `r4?` to `r7?`
|
||||
```
|
||||
|
||||
Notice how the load from memory has been folded into an operand of the `add`
|
||||
instruction. This transformation is possible when the value produced by the
|
||||
instruction being folded is not referenced by other instructions and the
|
||||
instructions belong to the same `InstructionGroupID` (see [Front-End:
|
||||
Optimization](../frontend/#optimization)).
|
||||
|
||||
### Example
|
||||
|
||||
At the end of the instruction selection phase, the basic blocks of our `abs`
|
||||
function will look as follows (for `arm64`):
|
||||
|
||||
```asm
|
||||
L1 (SSA Block: blk0):
|
||||
mov x130?, x2
|
||||
subs wzr, w130?, #0x0
|
||||
b.ge L2
|
||||
L3 (SSA Block: blk1):
|
||||
mov x136?, xzr
|
||||
sub w134?, w136?, w130?
|
||||
mov x135?, x134?
|
||||
b L4
|
||||
L2 (SSA Block: blk2):
|
||||
mov x135?, x130?
|
||||
L4 (SSA Block: blk3):
|
||||
mov x0, x135?
|
||||
ret
|
||||
```
|
||||
|
||||
Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
|
||||
These are labels that are used to mark the beginning of each basic block, and
|
||||
they are the target for branching instructions such as `b` and `b.ge`.
|
||||
|
||||
### Code
|
||||
|
||||
`backend.Machine` is the interface to the backend. It has a methods to
|
||||
translate (lower) the IR to machine code. Again, as seen earlier in the
|
||||
front-end, the term *lowering* is used to indicate translation from a
|
||||
higher-level representation to a lower-level representation.
|
||||
|
||||
`backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
|
||||
SSA instruction to machine code. Machine-specific implementations of this
|
||||
method can be found in package `backend/isa/<arch>` where `<arch>` is either
|
||||
`amd64` or `arm64`.
|
||||
|
||||
### Debug Flags
|
||||
|
||||
`wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
|
||||
lowered arch-specific instructions.
|
||||
|
||||
## Register Allocation
|
||||
|
||||
The register allocation phase is responsible for mapping the potentially
|
||||
infinite number of virtual registers to the real registers of the target
|
||||
architecture. Because the number of real registers is limited, the register
|
||||
allocation phase might need to "spill" some of the virtual registers to memory;
|
||||
that is, it might store their content, and then load them back into a register
|
||||
when they are needed.
|
||||
|
||||
For a given function `f` the register allocation procedure
|
||||
`regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
|
||||
|
||||
- `livenessAnalysis(f)` collects the "liveness" information for each virtual
|
||||
register. The algorithm is described in [Chapter 9.2 of The SSA
|
||||
Book][ssa-book].
|
||||
|
||||
- `alloc(f)` allocates registers for the given function. The algorithm is
|
||||
derived from [the Go compiler's
|
||||
allocator][go-regalloc]
|
||||
|
||||
At the end of the allocation procedure, we also record the set of registers
|
||||
that are **clobbered** by the body of the function. A register is clobbered
|
||||
if its value is overwritten by the function, and it is not saved by the
|
||||
callee. This information is used in the finalization phase to determine which
|
||||
registers need to be saved in the prologue and restored in the epilogue.
|
||||
to register allocation in a textbook meaning, but it is a necessary step
|
||||
for the finalization phase.
|
||||
|
||||
### Liveness Analysis
|
||||
|
||||
Intuitively, a variable or name binding can be considered _live_ at a certain
|
||||
point in a program, if its value will be used in the future.
|
||||
|
||||
For instance:
|
||||
|
||||
```
|
||||
1| int f(int x) {
|
||||
2| int y = 2 + x;
|
||||
3| int z = x + y;
|
||||
4| return z;
|
||||
5| }
|
||||
```
|
||||
|
||||
Variable `x` and `y` are both live at line 4, because they are used in the
|
||||
expression `x + y` on line 3; variable `z` is live at line 4, because it is
|
||||
used in the return statement. However, variables `x` and `y` can be considered
|
||||
_not_ live at line 4 because they are not used anywhere after line 3.
|
||||
|
||||
Statically, _liveness_ can be approximated by following paths backwards on the
|
||||
control-flow graph, connecting the uses of a given variable to its definitions
|
||||
(or its *unique* definition, assuming SSA form).
|
||||
|
||||
In practice, while liveness is a property of each name binding at any point in
|
||||
the program, it is enough to keep track of liveness at the boundaries of basic
|
||||
blocks:
|
||||
|
||||
- the _live-in_ set for a given basic block is the set of all bindings that are
|
||||
live at the entry of that block.
|
||||
- the _live-out_ set for a given basic block is the set of all bindings that
|
||||
are live at the exit of that block. A binding is live at the exit of a block
|
||||
if it is live at the entry of a successor.
|
||||
|
||||
Because the CFG is a connected graph, it is enough to keep track of either
|
||||
live-in or live-out sets, and then propagate the liveness information backward
|
||||
or forward, respectively. In our case, we keep track of live-in sets per block;
|
||||
live-outs are derived from live-ins of the successor blocks when a block is
|
||||
allocated.
|
||||
|
||||
### Allocation
|
||||
|
||||
We implemented a variant of the linear scan register allocation algorithm
|
||||
described in [the Go compiler's allocator][go-regalloc].
|
||||
|
||||
Each basic block is allocated registers in a linear scan order, and the
|
||||
allocation state is propagated from a given basic block to its successors.
|
||||
Then, each block continues allocation from that initial state.
|
||||
|
||||
#### Merge States
|
||||
|
||||
Special care has to be taken when a block has multiple predecessors. We call
|
||||
this *fixing merge states*: for instance, consider the following:
|
||||
|
||||
```goat { width="30%" }
|
||||
.---. .---.
|
||||
| BB0 | | BB1 |
|
||||
'-+-' '-+-'
|
||||
+----+----+
|
||||
|
|
||||
v
|
||||
.---.
|
||||
| BB2 |
|
||||
'---'
|
||||
```
|
||||
|
||||
if the live-out set of a given block `BB0` is different from the live-out set
|
||||
of a given block `BB1` and both are predecessors of a block `BB2`, then we need
|
||||
to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
|
||||
abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers
|
||||
or via stack; fixing merge states ensures that registers and stack are used
|
||||
consistently to pass values across the involved states.
|
||||
|
||||
#### Spilling
|
||||
|
||||
If the register allocator cannot find a free register for a given virtual
|
||||
(live) register, it needs to "spill" the value to the stack to get a free
|
||||
register, *i.e.,* stash it temporarily to stack. When that virtual register is
|
||||
reused later, we will have to insert instructions to reload the value into a
|
||||
real register.
|
||||
|
||||
While the procedure proceeds with allocation, the procedure also records all
|
||||
the virtual registers that transition to the "spilled" state, and inserts the
|
||||
reload instructions when those registers are reused later.
|
||||
|
||||
The spill instructions are actually inserted at the end of the register
|
||||
allocation, after all the allocations and the merge states have been fixed. At
|
||||
this point, all the other potential sources of instability have been resolved,
|
||||
and we know where all the reloads happen.
|
||||
|
||||
We insert the spills in the block that is the lowest common ancestor of all the
|
||||
blocks that reload the value.
|
||||
|
||||
#### Clobbered Registers
|
||||
|
||||
At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
|
||||
method iterates over the set of the allocated registers and compares them
|
||||
to a set of architecture-specific set `CalleeSavedRegisters`. If a register
|
||||
has been allocated, and it is present in this set, the register is marked as
|
||||
"clobbered", i.e., we now know that the register allocator will overwrite
|
||||
that value. Thus, these values will have to be spilled in the prologue.
|
||||
|
||||
#### References
|
||||
|
||||
Register allocation is a complex problem, possibly the most complicated
|
||||
part of the backend. The following references were used to implement the
|
||||
algorithm:
|
||||
|
||||
- https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
|
||||
- https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
|
||||
- https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
|
||||
- https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
|
||||
- https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
|
||||
|
||||
We suggest to refer to them to dive deeper in the topic.
|
||||
|
||||
### Example
|
||||
|
||||
At the end of the register allocation phase, the basic blocks of our `abs`
|
||||
function look as follows (for `arm64`):
|
||||
|
||||
```asm
|
||||
L1 (SSA Block: blk0):
|
||||
mov x2, x2
|
||||
subs wzr, w2, #0x0
|
||||
b.ge L2
|
||||
L3 (SSA Block: blk1):
|
||||
mov x8, xzr
|
||||
sub w8, w8, w2
|
||||
mov x8, x8
|
||||
b L4
|
||||
L2 (SSA Block: blk2):
|
||||
mov x8, x2
|
||||
L4 (SSA Block: blk3):
|
||||
mov x0, x8
|
||||
ret
|
||||
```
|
||||
|
||||
Notice how the virtual registers have been all replaced by real registers, i.e.
|
||||
no register identifier is suffixed with `?`. This example is quite simple, and
|
||||
it does not require any spill.
|
||||
|
||||
### Code
|
||||
|
||||
The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
|
||||
interfaces in `regalloc/api.go`.
|
||||
|
||||
Essentially:
|
||||
|
||||
- each architecture exposes iteration over basic blocks of a function
|
||||
(`regalloc.Function` interface)
|
||||
- each arch-specific basic block exposes iteration over instructions
|
||||
(`regalloc.Block` interface)
|
||||
- each arch-specific instruction exposes the set of registers it defines and
|
||||
uses (`regalloc.Instr` interface)
|
||||
|
||||
By defining these interfaces, the register allocation algorithm can assign real
|
||||
registers to virtual registers without dealing specifically with the target
|
||||
architecture.
|
||||
|
||||
In practice, each interface is usually implemented by instantiating a common
|
||||
generic struct that comes already with an implementation of all or most of the
|
||||
required methods. For instance,`regalloc.Function`is implemented by
|
||||
`backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
|
||||
|
||||
`backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
|
||||
contains the instantiation of the `regalloc.RegisterInfo` struct, which
|
||||
declares, among others
|
||||
- the set of registers that are available for allocation, excluding, for
|
||||
instance, those that might be reserved by the runtime or the OS
|
||||
(`AllocatableRegisters`)
|
||||
- the registers that might be saved by the callee to the stack
|
||||
(`CalleeSavedRegisters`)
|
||||
|
||||
### Debug Flags
|
||||
|
||||
- `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
|
||||
allocation procedure.
|
||||
- `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
|
||||
allocation result.
|
||||
|
||||
## Finalization and Encoding
|
||||
|
||||
At the end of the register allocation phase, we have enough information to
|
||||
finally generate machine code (_encoding_). We are only missing the prologue
|
||||
and epilogue of the function.
|
||||
|
||||
### Prologue and Epilogue
|
||||
|
||||
As usual, the **prologue** is executed before the main body of the function,
|
||||
and the **epilogue** is executed at the return. The prologue is responsible for
|
||||
setting up the stack frame, and the epilogue is responsible for cleaning up the
|
||||
stack frame and returning control to the caller.
|
||||
|
||||
Generally, this means, at the very least:
|
||||
- saving the return address
|
||||
- a base pointer to the stack; or, equivalently, the height of the stack at the
|
||||
beginning of the function
|
||||
|
||||
For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
|
||||
pointer:
|
||||
|
||||
```goat {width="100%" height="250"}
|
||||
(high address) (high address)
|
||||
RBP ----> +-----------------+ +-----------------+
|
||||
| `...` | | `...` |
|
||||
| ret Y | | ret Y |
|
||||
| `...` | | `...` |
|
||||
| ret 0 | | ret 0 |
|
||||
| arg X | | arg X |
|
||||
| `...` | ====> | `...` |
|
||||
| arg 1 | | arg 1 |
|
||||
| arg 0 | | arg 0 |
|
||||
| Return Addr | | Return Addr |
|
||||
RSP ----> +-----------------+ | Caller_RBP |
|
||||
(low address) +-----------------+ <----- RSP, RBP
|
||||
```
|
||||
|
||||
While, on `arm64`, there is only a stack pointer `SP`:
|
||||
|
||||
|
||||
```goat {width="100%" height="300"}
|
||||
(high address) (high address)
|
||||
SP ---> +-----------------+ +------------------+ <----+
|
||||
| `...` | | `...` | |
|
||||
| ret Y | | ret Y | |
|
||||
| `...` | | `...` | |
|
||||
| ret 0 | | ret 0 | |
|
||||
| arg X | | arg X | | size_of_arg_ret.
|
||||
| `...` | ====> | `...` | |
|
||||
| arg 1 | | arg 1 | |
|
||||
| arg 0 | | arg 0 | <----+
|
||||
+-----------------+ | size_of_arg_ret |
|
||||
| return address |
|
||||
+------------------+ <---- SP
|
||||
(low address) (low address)
|
||||
```
|
||||
|
||||
However, the prologue and epilogue might also be responsible for saving and
|
||||
restoring the state of registers that might be overwritten by the function
|
||||
("clobbered"); and, if spilling occurs, prologue and epilogue are also
|
||||
responsible for reserving and releasing the space for the spilled values.
|
||||
|
||||
For clarity, we make a distinction between the space reserved for the clobbered
|
||||
registers and the space reserved for the spilled values:
|
||||
|
||||
- Spill slots are used to temporarily store the values that needs spilling as
|
||||
determined by the register allocator. This section must have a fix height,
|
||||
but its contents will change over time, as registers are being spilled and
|
||||
reloaded.
|
||||
- Clobbered registers are, similarly, determined by the register allocator, but
|
||||
they are stashed in the prologue and then restored in the epilogue.
|
||||
|
||||
The procedure happens after the register allocation phase because at
|
||||
this point we have collected enough information to know how much space we need
|
||||
to reserve, and which registers are clobbered.
|
||||
|
||||
Regardless of the architecture, after allocating this space, the stack will
|
||||
look as follows:
|
||||
|
||||
```goat {height="350"}
|
||||
(high address)
|
||||
+-----------------+
|
||||
| `...` |
|
||||
| ret Y |
|
||||
| `...` |
|
||||
| ret 0 |
|
||||
| arg X |
|
||||
| `...` |
|
||||
| arg 1 |
|
||||
| arg 0 |
|
||||
| (arch-specific) |
|
||||
+-----------------+
|
||||
| clobbered M |
|
||||
| ............ |
|
||||
| clobbered 1 |
|
||||
| clobbered 0 |
|
||||
| spill slot N |
|
||||
| ............ |
|
||||
| spill slot 0 |
|
||||
+-----------------+
|
||||
(low address)
|
||||
```
|
||||
|
||||
Note: the prologue might also introduce a check of the stack bounds. If there
|
||||
is no sufficient space to allocate the stack frame, the function will exit the
|
||||
execution and will try to grow it from the Go runtime.
|
||||
|
||||
The epilogue simply reverses the operations of the prologue.
|
||||
|
||||
### Other Post-RegAlloc Logic
|
||||
|
||||
The `backend.Machine.PostRegAlloc` method is invoked after the register
|
||||
allocation procedure; while its main role is to define the prologue and
|
||||
epilogue of the function, it also serves as a hook to perform other,
|
||||
arch-specific duty, that has to happen after the register allocation phase.
|
||||
|
||||
For instance, on `amd64`, the constraints for some instructions are hard to
|
||||
express in a meaningful way for the register allocation procedure (for
|
||||
instance, the `div` instruction implicitly use registers `rdx`, `rax`).
|
||||
Instead, they are lowered with ad-hoc logic as part of the implementation
|
||||
`backend.Machine.PostRegAlloc` method.
|
||||
|
||||
### Encoding
|
||||
|
||||
The final stage of the backend encodes the machine instructions into bytes and
|
||||
writes them to the target buffer. Before proceeding with the encoding, relative
|
||||
addresses in branching instructions or addressing modes are resolved.
|
||||
|
||||
The procedure encodes the instructions in the order they appear in the
|
||||
function.
|
||||
|
||||
### Code
|
||||
|
||||
- The prologue and epilogue are set up as part of the
|
||||
`backend.Machine.PostRegAlloc` method.
|
||||
- The encoding is done by the `backend.Machine.Encode` method.
|
||||
|
||||
### Debug Flags
|
||||
|
||||
- `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
|
||||
function after the finalization phase.
|
||||
- `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
|
||||
representation of the function generated code as it is.
|
||||
- `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
|
||||
representation of the function generated code that can be disassembled.
|
||||
|
||||
The reason for the distinction between the last two flags is that the generated
|
||||
code in some cases might not be disassemblable.
|
||||
`PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
|
||||
the generated code that can be disassembled, but cannot be executed.
|
||||
|
||||
<hr>
|
||||
|
||||
* Previous Section: [Front-End](../frontend/)
|
||||
* Next Section: [Appendix: Trampolines](../appendix/)
|
||||
|
||||
[ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
|
||||
[go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
|
||||
371
site/content/docs/how_the_optimizing_compiler_works/frontend.md
Normal file
371
site/content/docs/how_the_optimizing_compiler_works/frontend.md
Normal file
@@ -0,0 +1,371 @@
|
||||
+++
|
||||
title = "How the Optimizing Compiler Works: Front-End"
|
||||
layout = "single"
|
||||
+++
|
||||
|
||||
In this section we will discuss the phases in the front-end of the optimizing compiler:
|
||||
|
||||
- [Translation to SSA](#translation-to-ssa)
|
||||
- [Optimization](#optimization)
|
||||
- [Block Layout](#block-layout)
|
||||
|
||||
Every section includes an explanation of the phase; the subsection **Code**
|
||||
will include high-level pointers to functions and packages; the subsection **Debug Flags**
|
||||
indicates the flags that can be used to enable advanced logging of the phase.
|
||||
|
||||
## Translation to SSA
|
||||
|
||||
We mentioned earlier that wazero uses an internal representation called an "SSA"
|
||||
form or "Static Single-Assignment" form, but we never explained what that is.
|
||||
|
||||
In short terms, every program, or, in our case, every Wasm function, can be
|
||||
translated in a control-flow graph. The control-flow graph is a directed graph where
|
||||
each node is a sequence of statements that do not contain a control flow instruction,
|
||||
called a **basic block**. Instead, control-flow instructions are translated into edges.
|
||||
|
||||
For instance, take the following implementation of the `abs` function:
|
||||
|
||||
```wasm
|
||||
(module
|
||||
(func (;0;) (param i32) (result i32)
|
||||
(if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
|
||||
(then
|
||||
(i32.sub (i32.const 0) (local.get 0)))
|
||||
(else
|
||||
(local.get 0))
|
||||
)
|
||||
)
|
||||
(export "f" (func 0))
|
||||
)
|
||||
```
|
||||
|
||||
This is translated to the following block diagram:
|
||||
|
||||
```goat {width="100%" height="500"}
|
||||
+---------------------------------------------+
|
||||
|blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
|
||||
| v3:i32 = Iconst_32 0x0 |
|
||||
| v4:i32 = Icmp lt_s, v2, v3 |
|
||||
| Brz v4, blk2 |
|
||||
| Jump blk1 |
|
||||
+---------------------------------------------+
|
||||
|
|
||||
|
|
||||
+---`(v4 != 0)`-+-`(v4 == 0)`---+
|
||||
| |
|
||||
v v
|
||||
+---------------------------+ +---------------------------+
|
||||
|blk1: () <-- (blk0) | |blk2: () <-- (blk0) |
|
||||
| v6:i32 = Iconst_32 0x0 | | Jump blk3, v2 |
|
||||
| v7:i32 = Isub v6, v2 | | |
|
||||
| Jump blk3, v7 | | |
|
||||
+---------------------------+ +---------------------------+
|
||||
| |
|
||||
| |
|
||||
+-`{v5 := v7}`--+--`{v5 := v2}`-+
|
||||
|
|
||||
v
|
||||
+------------------------------+
|
||||
|blk3: (v5:i32) <-- (blk1,blk2)|
|
||||
| Jump blk_ret, v5 |
|
||||
+------------------------------+
|
||||
|
|
||||
{return v5}
|
||||
|
|
||||
v
|
||||
```
|
||||
|
||||
We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
|
||||
representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
|
||||
takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
|
||||
Jump, etc...) with an optional list of arguments; these arguments are assigned
|
||||
to the target block's arguments like a function.
|
||||
|
||||
Consider the first block `blk0`.
|
||||
|
||||
```
|
||||
blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
|
||||
v3:i32 = Iconst_32 0x0
|
||||
v4:i32 = Icmp lt_s, v2, v3
|
||||
Brz v4, blk2
|
||||
Jump blk1
|
||||
```
|
||||
|
||||
You will notice that, compared to the original function, it takes two extra
|
||||
parameters (`exec_ctx` and `module_ctx`):
|
||||
|
||||
1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
|
||||
in the face of traps or for host function calls.
|
||||
2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
|
||||
to access memory.
|
||||
|
||||
It then takes one parameter `v2`, corresponding to the function parameter, and
|
||||
it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
|
||||
comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
|
||||
`blk2` if `v4` is zero, otherwise it jumps to `blk1`.
|
||||
|
||||
You might also have noticed that the instructions do not correspond strictly to
|
||||
the original Wasm opcodes. This is because, similarly to the wazero IR used by
|
||||
the old compiler, this is a custom IR.
|
||||
|
||||
You will also notice that, _on the right-hand side of the assignments_ of any statement,
|
||||
no name occurs _twice_: this is why this form is called **single-assignment**.
|
||||
|
||||
Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
|
||||
|
||||
```
|
||||
blk1: ()
|
||||
...
|
||||
Jump blk3, v7
|
||||
|
||||
blk2: ()
|
||||
Jump blk3, v2
|
||||
|
||||
blk3: (v5:i32)
|
||||
...
|
||||
```
|
||||
|
||||
`blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
|
||||
to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
|
||||
depending on the originating block. If you are familiar with the traditional
|
||||
representation of an SSA form, you will recognize that the role of block
|
||||
arguments is equivalent to the role of the *Phi (Φ) function*, a special
|
||||
function that returns a different value depending on the incoming edge; e.g., in
|
||||
this case: `v5 := Φ(v7, v2)`.
|
||||
|
||||
### Code
|
||||
|
||||
The relevant APIs can be found under sub-package `ssa` and `frontend`.
|
||||
In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
|
||||
because such transformations usually correspond to targeting a lower abstraction level.
|
||||
|
||||
- Basic Blocks are represented by the type `ssa.Block`.
|
||||
- The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
|
||||
in the context of `wasm.Engine.CompileModule()`, more specifically in the method
|
||||
`frontend.Compiler.LowerToSSA()`.
|
||||
- The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
|
||||
more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
|
||||
- Because they are semantically equivalent, in the code, basic block parameters
|
||||
are sometimes referred to as "Phi values".
|
||||
|
||||
#### Instructions and Values
|
||||
|
||||
An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
|
||||
consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
|
||||
instructions may not produce any value (for instance, a `Jump` instruction).
|
||||
An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
|
||||
to represent the result of an instruction, or the input to an instruction.
|
||||
|
||||
For instance:
|
||||
|
||||
```
|
||||
blk1: () <-- (blk0)
|
||||
v6:i32 = Iconst_32 0x0
|
||||
v7:i32 = Isub v6, v2
|
||||
Jump blk3, v7
|
||||
```
|
||||
|
||||
`Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
|
||||
and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
|
||||
such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
|
||||
|
||||
- `i32`: 32-bit integer
|
||||
- `i64`: 64-bit integer
|
||||
- `f32`: 32-bit floating point
|
||||
- `f64`: 64-bit floating point
|
||||
- `v128`: 128-bit SIMD vector
|
||||
|
||||
For simplicity, we don't have a dedicated type for pointers. Instead, we use the `i64`
|
||||
type to represent pointer values since we only support 64-bit architectures,
|
||||
unlike traditional compilers such as LLVM.
|
||||
|
||||
Values and instructions are both allocated from pools to minimize memory allocations.
|
||||
|
||||
### Debug Flags
|
||||
|
||||
- `wazevoapi.PrintSSA` dumps the SSA form to the console.
|
||||
- `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
|
||||
opcodes and SSA instructions to the console.
|
||||
|
||||
## Optimization
|
||||
|
||||
The SSA form makes it easier to perform a number of optimizations. For instance,
|
||||
we can perform constant propagation, dead code elimination, and common
|
||||
subexpression elimination. These optimizations either act upon the instructions
|
||||
within a basic block, or they act upon the control-flow graph as a whole.
|
||||
|
||||
On a high, level, consider the following basic block, derived from the previous
|
||||
example:
|
||||
|
||||
```
|
||||
blk0: (exec_ctx:i64, module_ctx:i64)
|
||||
v2:i32 = Iconst_32 -5
|
||||
v3:i32 = Iconst_32 0
|
||||
v4:i32 = Icmp lt_s, v2, v3
|
||||
Brz v4, blk2
|
||||
Jump blk1
|
||||
```
|
||||
|
||||
It is pretty easy to see that the comparison in `v4` can be replaced by a
|
||||
constant `1`, because the comparison is between two constant values (-5, 0).
|
||||
Therefore, the block can be rewritten as such:
|
||||
|
||||
```
|
||||
blk0: (exec_ctx:i64, module_ctx:i64)
|
||||
v4:i32 = Iconst_32 1
|
||||
Brz v4, blk2
|
||||
Jump blk1
|
||||
```
|
||||
|
||||
However, we can now also see that the branch is always taken, and that the block
|
||||
`blk2` is never executed, so even the branch instruction and the constant
|
||||
definition `v4` can be removed:
|
||||
|
||||
```
|
||||
blk0: (exec_ctx:i64, module_ctx:i64)
|
||||
Jump blk1
|
||||
```
|
||||
|
||||
This is a simple example of constant propagation and dead code elimination
|
||||
occurring within a basic block. However, now `blk2` is unreachable, because
|
||||
there is no other edge in the edge that points to it; thus it can be removed
|
||||
from the control-flow graph. This is an example of dead-code elimination that
|
||||
occurs at the control-flow graph level.
|
||||
|
||||
In practice, because WebAssembly is a compilation target, these simple
|
||||
optimizations are often unnecessary. The optimization passes implemented in
|
||||
wazero are also work-in-progress and, at the time of writing, further work is
|
||||
expected to implement more advanced optimizations.
|
||||
|
||||
### Code
|
||||
|
||||
Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
|
||||
pass is just a function that takes a ssa builder as a parameter.
|
||||
|
||||
Passes iterate over the basic blocks, and, for each basic block, they iterate
|
||||
over the instructions. Each pass may mutate the basic block by modifying the instructions
|
||||
it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
|
||||
blocks).
|
||||
|
||||
Currently, there are two dead-code elimination passes:
|
||||
|
||||
- `passDeadBlockEliminationOpt` acting at the block-level.
|
||||
- `passDeadCodeEliminationOpt` acting at instruction-level.
|
||||
|
||||
Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
|
||||
instruction. This is used to determine whether a sequence of instructions can be
|
||||
replaced by a single machine instruction during the back-end phase. For more details,
|
||||
see also the relevant documentation in `ssa/instructions.go`
|
||||
|
||||
There are also simple constant folding passes such as `passNopInstElimination`, which
|
||||
folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
|
||||
|
||||
### Debug Flags
|
||||
|
||||
`wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
|
||||
|
||||
|
||||
## Block Layout
|
||||
|
||||
As we have seen earlier, the SSA form instructions are contained within basic
|
||||
blocks, and the basic blocks are connected by edges of the control-flow graph.
|
||||
However, machine code is not laid out in a graph, but it is just a linear
|
||||
sequence of instructions.
|
||||
|
||||
Thus, the last step of the front-end is to lay out the basic blocks in a linear
|
||||
sequence. Because each basic block, by design, ends with a control-flow
|
||||
instruction, one of the goals of the block layout phase is to maximize the number of
|
||||
**fall-through opportunities**. A fall-through opportunity occurs when a block ends
|
||||
with a jump instruction whose target is exactly the next block in the
|
||||
sequence. In order to maximize the number of fall-through opportunities, the
|
||||
block layout phase might reorder the basic blocks in the control-flow graph,
|
||||
and transform the control-flow instructions. For instance, it might _invert_
|
||||
some branching conditions.
|
||||
|
||||
The end goal is to effectively minimize the number of jumps and branches in
|
||||
the machine code that will be generated later.
|
||||
|
||||
|
||||
### Critical Edges
|
||||
|
||||
Special attention must be taken when a basic block has multiple predecessors,
|
||||
i.e., when it has multiple incoming edges. In particular, an edge between two
|
||||
basic blocks is called a **critical edge** when, at the same time:
|
||||
- the predecessor has multiple successors **and**
|
||||
- the successor has multiple predecessors.
|
||||
|
||||
For instance, in the example below the edge between `BB0` and `BB3`
|
||||
is a critical edge.
|
||||
|
||||
```goat { width="300" }
|
||||
┌───────┐ ┌───────┐
|
||||
│ BB0 │━┓ │ BB1 │
|
||||
└───────┘ ┃ └───────┘
|
||||
│ ┃ │
|
||||
▼ ┃ ▼
|
||||
┌───────┐ ┃ ┌───────┐
|
||||
│ BB2 │ ┗━▶│ BB3 │
|
||||
└───────┘ └───────┘
|
||||
```
|
||||
|
||||
In these cases the critical edge is split by introducing a new basic block,
|
||||
called a **trampoline**, where the critical edge was.
|
||||
|
||||
```goat { width="300" }
|
||||
┌───────┐ ┌───────┐
|
||||
│ BB0 │──────┐ │ BB1 │
|
||||
└───────┘ ▼ └───────┘
|
||||
│ ┌──────────┐ │
|
||||
│ │trampoline│ │
|
||||
▼ └──────────┘ ▼
|
||||
┌───────┐ │ ┌───────┐
|
||||
│ BB2 │ └────▶│ BB3 │
|
||||
└───────┘ └───────┘
|
||||
```
|
||||
|
||||
For more details on critical edges read more at
|
||||
|
||||
- https://en.wikipedia.org/wiki/Control-flow_graph
|
||||
- https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
|
||||
|
||||
### Example
|
||||
|
||||
At the end of the block layout phase, the laid out SSA for the `abs` function
|
||||
looks as follows:
|
||||
|
||||
```
|
||||
blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
|
||||
v3:i32 = Iconst_32 0x0
|
||||
v4:i32 = Icmp lt_s, v2, v3
|
||||
Brz v4, blk2
|
||||
Jump fallthrough
|
||||
|
||||
blk1: () <-- (blk0)
|
||||
v6:i32 = Iconst_32 0x0
|
||||
v7:i32 = Isub v6, v2
|
||||
Jump blk3, v7
|
||||
|
||||
blk2: () <-- (blk0)
|
||||
Jump fallthrough, v2
|
||||
|
||||
blk3: (v5:i32) <-- (blk1,blk2)
|
||||
Jump blk_ret, v5
|
||||
```
|
||||
|
||||
### Code
|
||||
|
||||
`passLayoutBlocks` implements the block layout phase.
|
||||
|
||||
### Debug Flags
|
||||
|
||||
- `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
|
||||
- `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
|
||||
such as inverting branching conditions or splitting critical edges.
|
||||
|
||||
<hr>
|
||||
|
||||
* Previous Section: [How the Optimizing Compiler Works](../)
|
||||
* Next Section: [Back-End](../backend/)
|
||||
|
||||
[ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
|
||||
[llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes
|
||||
Reference in New Issue
Block a user