wazevo(docs): optimizing compiler (#2065)

Signed-off-by: Edoardo Vacchi <evacchi@users.noreply.github.com>
2024-03-09 08:39:11 +08:00
parent 15cc0c59f3
commit b7b54d5967
5 changed files with 1196 additions and 1 deletions
--- a/site/content/docs/_index.md
+++ b/site/content/docs/_index.md
@@ -143,7 +143,8 @@ Notably, the interpreter and compiler in wazero's [Runtime configuration][Runtim
 In wazero, a compiler is a runtime configured to compile modules to platform-specific machine code ahead of time (AOT)
 during the creation of [CompiledModule][CompiledModule]. This means your WebAssembly functions execute
 natively at runtime of the embedding Go program. Compiler is faster than Interpreter, often by order of
-magnitude (10x) or more, and therefore enabled by default whenever available.
+magnitude (10x) or more, and therefore enabled by default whenever available. You can read more about wazero's
 [optimizing compiler in the detailed documentation]({{< relref "/how_the_optimizing_compiler_works" >}}).
 #### Interpreter
--- a/site/content/docs/how_the_optimizing_compiler_works/_index.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/_index.md
@@ -0,0 +1,131 @@
 +++
 title = "How the Optimizing Compiler Works"
 layout = "single"
 +++
 wazero supports two modes of execution: interpreter mode and compilation mode.
 The interpreter mode is a fallback mode for platforms where compilation is not
 supported. Compilation mode is otherwise the default mode of execution: it
 translates Wasm modules to native code to get the best run-time performance.
 Translating Wasm bytecode into machine code can take multiple forms.  wazero
 1.0 performs a straightforward translation from a given instruction to a native
 instruction. wazero 2.0 introduces an optimizing compiler that is able to
 perform nontrivial optimizing transformations, such as constant folding or
 dead-code elimination, and it makes better use of the underlying hardware, such
 as CPU registers. This document digs deeper into what we mean when we say
 "optimizing compiler", and explains how it is implemented in wazero.
 This document is intended for maintainers, researchers, developers and in
 general anyone interested in understanding the internals of wazero.
 What is an Optimizing Compiler?
 -------------------------------
 Wazero supports an _optimizing_ compiler in the style of other optimizing
 compilers such as LLVM's or V8's. Traditionally an optimizing
 compiler performs compilation in a number of steps.
 Compare this to the **old compiler**, where compilation happens in one step or
 two, depending on how you count:
 ```goat
    Input         +---------------+     +---------------+
 Wasm Binary ---->| DecodeModule  |---->| CompileModule |----> wazero IR
                  +---------------+     +---------------+
 ```
 That is, the module is (1) validated then (2) translated to an Intermediate
 Representation (IR). The wazero IR can then be executed directly (in the case
 of the interpreter) or it can be further processed and translated into native
 code by the compiler. This compiler performs a straightforward translation from
 the IR to native code, without any further passes. The wazero IR is not intended
 for further processing beyond immediate execution or straightforward
 translation.
 ```goat
                +----   wazero IR    ----+
                |                        |
                v                        v
        +--------------+         +--------------+
        |   Compiler   |         | Interpreter  |- - -  executable
        +--------------+         +--------------+
                |
     +----------+---------+
     |                    |
     v                    v
 +---------+          +---------+
 |  ARM64  |          |  AMD64  |
 | Backend |          | Backend |    - - - - - - - - -   executable
 +---------+          +---------+
 ```
 Validation and translation to an IR in a compiler are usually called the
 **front-end** part of a compiler, while code-generation occurs in what we call
 the **back-end** of a compiler. The front-end is the part of a compiler that is
 closer to the input, and it generally indicates machine-independent processing,
 such as parsing and static validation. The back-end is the part of a compiler
 that is closer to the output, and it generally includes machine-specific
 procedures, such as code-generation.
 In the **optimizing** compiler, we still decode and translate Wasm binaries to
 an intermediate representation in the front-end, but we use a textbook
 representation called an **SSA** or "Static Single-Assignment Form", that is
 intended for further transformation.
 The benefit of choosing an IR that is meant for transformation is that a lot of
 optimization passes can apply directly to the IR, and thus be
 machine-independent. Then the back-end can be relatively simpler, in that it
 will only have to deal with machine-specific concerns.
 The wazero optimizing compiler implements the following compilation passes:
 * Front-End:
  - Translation to SSA
  - Optimization
  - Block Layout
  - Control Flow Analysis
 * Back-End:
  - Instruction Selection
  - Registry Allocation
  - Finalization and Encoding
 ```goat
     Input          +-------------------+      +-------------------+
  Wasm Binary   --->|   DecodeModule    |----->|   CompileModule   |--+
                    +-------------------+      +-------------------+  |
           +----------------------------------------------------------+
           |
           |  +---------------+            +---------------+
           +->|   Front-End   |----------->|   Back-End    |
              +---------------+            +---------------+
                      |                            |
                      v                            v
                     SSA                 Instruction Selection
                      |                            |
                      v                            v
                Optimization              Registry Allocation
                      |                            |
                      v                            v
                Block Layout             Finalization/Encoding
 ```
 Like the other engines, the implementation can be found under `engine`, specifically
 in the `wazevo` sub-package. The entry-point is found under `internal/engine/wazevo/engine.go`,
 where the implementation of the interface `wasm.Engine` is found.
 All the passes can be dumped to the console for debugging, by enabling, the build-time
 flags under `internal/engine/wazevo/wazevoapi/debug_options.go`. The flags are disabled
 by default and should only be enabled during debugging. These may also change in the future.
 In the following we will assume all paths to be relative to the `internal/engine/wazevo`,
 so we will omit the prefix.
 ## Index
 - [Front-End](frontend/)
 - [Back-End](backend/)
 - [Appendix](appendix/)
--- a/site/content/docs/how_the_optimizing_compiler_works/appendix.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/appendix.md
@@ -0,0 +1,185 @@
 +++
 title = "Appendix: Trampolines"
 layout = "single"
 +++
 Trampolines are used to interface between the Go runtime and the generated
 code, in two cases:
 - when we need to **enter the generated code** from the Go runtime.
 - when we need to **leave the generated code** to invoke a host function
  (written in Go).
 In this section we want to complete the picture of how a Wasm function gets
 translated from Wasm to executable code in the optimizing compiler, by
 describing how to jump into the execution of the generated code at run-time.
 ## Entering the Generated Code
 At run-time, user space invokes a Wasm function through the public
 `api.Function` interface, using methods `Call()` or `CallWithStack()`.  The
 implementation of this method, in turn, eventually invokes an ASM
 **trampoline**. The signature of this trampoline in Go code is:
 ```go
 func entrypoint(
 	preambleExecutable, functionExecutable *byte,
 	executionContextPtr uintptr, moduleContextPtr *byte,
 	paramResultStackPtr *uint64,
 	goAllocatedStackSlicePtr uintptr)
 ```
 - `preambleExecutable` is a pointer to the generated code for the preamble (see
  below)
 - `functionExecutable` is a pointer to the generated code for the function (as
  described in the previous sections).
 - `executionContextPtr` is a raw pointer to the `wazevo.executionContext`
  struct. This struct is used to save the state of the Go runtime before
 entering or leaving the generated code. It also holds shared state between the
 Go runtime and the generated code, such as the exit code that is used to
 terminate execution on failure, or suspend it to invoke host functions.
 - `moduleContextPtr` is a pointer to the `wazevo.moduleContextOpaque` struct.
  This struct Its contents are basically the pointers to the module instance,
 specific objects as well as functions. This is sometimes called "VMContext" in
 other Wasm runtimes.
 - `paramResultStackPtr` is a pointer to the slice where the arguments and
  results of the function are passed.
 - `goAllocatedStackSlicePtr` is an aligned pointer to the Go-allocated stack
  for holding values and call frames. For further details refer to
  [Backend § Prologue and Epilogue](../backend/#prologue-and-epilogue)
 The trampoline can be found in`backend/isa/<arch>/abi_entry_<arch>.s`.
 For each given architecture, the trampoline:
 - moves the arguments to specific registers to match the behavior of the entry preamble or trampoline function, and
 - finally, it jumps into the execution of the generated code for the preamble
 The **preamble** that will be jumped from `entrypoint` function is generated per function signature.
 This is implemented in `machine.CompileEntryPreamble(*ssa.Signature)`.
 The preamble sets the fields in the `wazevo.executionContext`.
 At the beginning of the preamble:
 - Set a register to point to the `*wazevo.executionContext` struct.
 - Save the stack pointers, frame pointers, return addresses, etc. to that
  struct.
 - Update the stack pointer to point to `paramResultStackPtr`.
 The generated code works in concert with the assumption that the preamble has
 been entered through the aforementioned trampoline. Thus, it assumes that the
 arguments can be found in some specific registers.
 The preamble then assigns the arguments pointed at by `paramResultStackPtr` to
 the registers and stack location that the generated code expects.
 Finally, it invokes the generated code for the function.
 The epilogue reverses part of the process, finally returning control to the
 caller of the `entrypoint()` function, and the Go runtime. The caller of
 `entrypoint()` is also responsible for completing the cleaning up procedure by
 invoking `afterGoFunctionCallEntrypoint()` (again, implemented in
 backend-specific ASM).  which will restore the stack pointers and return
 control to the caller of the function.
 The arch-specific code can be found in
 `backend/isa/<arch>/abi_entry_preamble.go`.
 [wazero-engine-stack]: https://github.com/tetratelabs/wazero/blob/095b49f74a5e36ce401b899a0c16de4eeb46c054/internal/engine/compiler/engine.go#L77-L132
 [abi-arm64]: https://tip.golang.org/src/cmd/compile/abi-internal#arm64-architecture
 [abi-amd64]: https://tip.golang.org/src/cmd/compile/abi-internal#amd64-architecture
 [abi-cc]: https://tip.golang.org/src/cmd/compile/abi-internal#function-call-argument-and-result-passing
 ## Leaving the Generated Code
 In "[How do compiler functions work?][how-do-compiler-functions-work]", we
 already outlined how _leaving_ the generated code works with the help of a
 function. We will complete here the picture by briefly describing the code that
 is generated.
 When the generated code needs to return control to the Go runtime, it inserts a
 meta-instruction that is called `exitSequence` in both `amd64` and `arm64`
 backends.  This meta-instruction sets the `exitCode` in the
 `wazevo.executionContext` struct, restore the stack pointers and then returns
 control to the caller of the `entrypoint()` function described above.
 As described in "[How do compiler functions
 work?][how-do-compiler-functions-work]", the mechanism is essentially the same
 when invoking a host function or raising an error. However, when a function is
 invoked the `exitCode` also indicates the identifier of the host function to be
 invoked.
 The magic really happens in the `backend.Machine.CompileGoFunctionTrampoline()`
 method.  This method is actually invoked when host modules are being
 instantiated.  It generates a trampoline that is used to invoke such functions
 from the generated code.
 This trampoline implements essentially the same prologue as the `entrypoint()`,
 but it also reserves space for the arguments and results of the function to be
 invoked.
 A host function has the signature:
 ```
 func(ctx context.Context, stack []uint64)
 ```
 the function arguments in the `stack` parameter are copied over to the reserved
 slots of the real stack. For instance, on `arm64` the stack layout would look
 as follows (on `amd64` it would be similar):
 ```goat
                  (high address)
    SP ------> +-----------------+  <----+
               |     .......     |       |
               |      ret Y      |       |
               |     .......     |       |
               |      ret 0      |       |
               |      arg X      |       |  size_of_arg_ret
               |     .......     |       |
               |      arg 1      |       |
               |      arg 0      |  <----+ <-------- originalArg0Reg
               | size_of_arg_ret |
               |  ReturnAddress  |
               +-----------------+ <----+
               |      xxxx       |      |  ;; might be padded to make it 16-byte aligned.
          +--->|  arg[N]/ret[M]  |      |
 sliceSize|    |   ............  |      | goCallStackSize
          |    |  arg[1]/ret[1]  |      |
          +--->|  arg[0]/ret[0]  | <----+ <-------- arg0ret0AddrReg
               |    sliceSize    |
               |   frame_size    |
               +-----------------+
                  (low address)
 ```
 Finally, the trampoline jumps into the execution of the host function using the
 `exitSequence` meta-instruction.
 Upon return, the process is reversed.
 ## Code
 - The trampoline to enter the generated function is implemented by the
  `backend.Machine.CompileEntryPreamble()` method.
 - The trampoline to return traps and invoke host functions is generated by
  `backend.Machine.CompileGoFunctionTrampoline()` method.
 You can find arch-specific implementations in
 `backend/isa/<arch>/abi_go_call.go`,
 `backend/isa/<arch>/abi_entry_preamble.go`, etc. The trampolines are found
 under `backend/isa/<arch>/abi_entry_<arch>.s`.
 ## Further References
 - Go's [internal ABI documentation][abi-internal] details the calling convention similar to the one we use in both arm64 and amd64 backend.
 - Raphael Poss's [The Go low-level calling convention on
  x86-64][go-call-conv-x86] is also an excellent reference for `amd64`.
 [abi-internal]: https://tip.golang.org/src/cmd/compile/abi-internal
 [go-call-conv-x86]: https://dr-knz.net/go-calling-convention-x86-64.html
 [proposal-register-cc]: https://go.googlesource.com/proposal/+/master/design/40724-register-calling.md#background
 [how-do-compiler-functions-work]: ../../how_do_compiler_functions_work/
--- a/site/content/docs/how_the_optimizing_compiler_works/backend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/backend.md
@@ -0,0 +1,507 @@
 +++
 title = "How the Optimizing Compiler Works: Back-End"
 layout = "single"
 +++
 In this section we will discuss the phases in the back-end of the optimizing
 compiler:
 - [Instruction Selection](#instruction-selection)
 - [Register Allocation](#register-allocation)
 - [Finalization and Encoding](#finalization-and-encoding)
 Each section will include a brief explanation of the phase, references to the
 code that implements the phase, and a description of the debug flags that can
 be used to inspect that phase.  Please notice that, since the implementation of
 the back-end is architecture-specific, the code might be different for each
 architecture.
 ### Code
 The higher-level entry-point to the back-end is the
 `backend.Compiler.Compile(context.Context)` method.  This method executes, in
 turn, the following methods in the same type:
 - `backend.Compiler.Lower()` (instruction selection)
 - `backend.Compiler.RegAlloc()` (register allocation)
 - `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
 ## Instruction Selection
 The instruction selection phase is responsible for mapping the higher-level SSA
 instructions to arch-specific instructions. Each SSA instruction is translated
 to one or more machine instructions.
 Each target architecture comes with a different number of registers, some of
 them are general purpose, others might be specific to certain instructions. In
 general, we can expect to have a set of registers for integer computations,
 another set for floating point computations, a set for vector (SIMD)
 computations, and some specific special-purpose registers (e.g. stack pointers,
 program counters, status flags, etc.)
 In addition, some registers might be reserved by the Go runtime or the
 Operating System for specific purposes, so they should be handled with special
 care.
 At this point in the compilation process we do not want to deal with all that.
 Instead, we assume that we have a potentially infinite number of *virtual
 registers* of each type at our disposal. The next phase, the register
 allocation phase, will map these virtual registers to the actual registers of
 the target architecture.
 ### Operands and Addressing Modes
 As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
 then use that virtual register as one of the arguments of the machine
 instruction that we will generate. However, usually instructions are able to
 address more than just registers: an *operand* might be able to represent a
 memory address, or an immediate value (i.e. a constant value that is encoded as
 part of the instruction itself).
 For these reasons, instead of mapping each `ssa.Value` to a virtual register
 (`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
 `operand` type.
 During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
 an argument of the instruction, in the simplest case, the `operand` might be
 mapped to a virtual register, in other cases, the `operand` might be mapped to
 a memory address, or an immediate value. Sometimes this makes it possible to
 replace several SSA instructions with a single machine instruction, by folding
 the addressing mode into the instruction itself.
 For instance, consider the following SSA instructions:
 ```
    v4:i32 = Const 0x9
    v6:i32 = Load v5, 0x4
    v7:i32 = Iadd v6, v4
 ```
 In the `amd64` architecture, the `add` instruction adds the second operand to
 the first operand, and assigns the result to the second operand. So assuming
 that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
 registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
 instruction on `amd64` might look like this:
 ```asm
    ;; AT&T syntax
    add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
    mov %r4?, %r7?     ;; move the result from `r4?` to `r7?`
 ```
 Notice how the load from memory has been folded into an operand of the `add`
 instruction. This transformation is possible when the value produced by the
 instruction being folded is not referenced by other instructions and the
 instructions belong to the same `InstructionGroupID` (see [Front-End:
 Optimization](../frontend/#optimization)).
 ### Example
 At the end of the instruction selection phase, the basic blocks of our `abs`
 function will look as follows (for `arm64`):
 ```asm
 L1 (SSA Block: blk0):
 	mov x130?, x2
 	subs wzr, w130?, #0x0
 	b.ge L2
 L3 (SSA Block: blk1):
 	mov x136?, xzr
 	sub w134?, w136?, w130?
 	mov x135?, x134?
 	b L4
 L2 (SSA Block: blk2):
 	mov x135?, x130?
 L4 (SSA Block: blk3):
 	mov x0, x135?
 	ret
 ```
 Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
 These are labels that are used to mark the beginning of each basic block, and
 they are the target for branching instructions such as `b` and `b.ge`.
 ### Code
 `backend.Machine` is the interface to the backend. It has a methods to
 translate (lower) the IR to machine code.  Again, as seen earlier in the
 front-end, the term *lowering* is used to indicate translation from a
 higher-level representation to a lower-level representation.
 `backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
 SSA instruction to machine code.  Machine-specific implementations of this
 method can be found in package `backend/isa/<arch>` where `<arch>` is either
 `amd64` or `arm64`.
 ### Debug Flags
 `wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
 lowered arch-specific instructions.
 ## Register Allocation
 The register allocation phase is responsible for mapping the potentially
 infinite number of virtual registers to the real registers of the target
 architecture. Because the number of real registers is limited, the register
 allocation phase might need to "spill" some of the virtual registers to memory;
 that is, it might store their content, and then load them back into a register
 when they are needed.
 For a given function `f` the register allocation procedure
 `regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
 - `livenessAnalysis(f)` collects the "liveness" information for each virtual
  register. The algorithm is described in [Chapter 9.2 of The SSA
 Book][ssa-book].
 - `alloc(f)` allocates registers for the given function. The algorithm is
  derived from [the Go compiler's
 allocator][go-regalloc]
 At the end of the allocation procedure, we also record the set of registers
 that are **clobbered** by the body of the function. A register is clobbered
 if its value is overwritten by the function, and it is not saved by the
 callee. This information is used in the finalization phase to determine which
 registers need to be saved in the prologue and restored in the epilogue.
 to register allocation in a textbook meaning, but it is a necessary step
 for the finalization phase.
 ### Liveness Analysis
 Intuitively, a variable or name binding can be considered _live_ at a certain
 point in a program, if its value will be used in the future.
 For instance:
 ```
 1| int f(int x) {
 2|   int y = 2 + x;
 3|   int z = x + y;
 4|   return z;
 5| }
 ```
 Variable `x` and `y` are both live at line 4, because they are used in the
 expression `x + y` on line 3; variable `z` is live at line 4, because it is
 used in the return statement.  However, variables `x` and `y` can be considered
 _not_ live at line 4 because they are not used anywhere after line 3.
 Statically, _liveness_ can be approximated by following paths backwards on the
 control-flow graph, connecting the uses of a given variable to its definitions
 (or its *unique* definition, assuming SSA form).
 In practice, while liveness is a property of each name binding at any point in
 the program, it is enough to keep track of liveness at the boundaries of basic
 blocks:
 - the _live-in_ set for a given basic block is the set of all bindings that are
  live at the entry of that block.
 - the _live-out_ set for a given basic block is the set of all bindings that
  are live at the exit of that block. A binding is live at the exit of a block
 if it is live at the entry of a successor.
 Because the CFG is a connected graph, it is enough to keep track of either
 live-in or live-out sets, and then propagate the liveness information backward
 or forward, respectively. In our case, we keep track of live-in sets per block;
 live-outs are derived from live-ins of the successor blocks when a block is
 allocated.
 ### Allocation
 We implemented a variant of the linear scan register allocation algorithm
 described in [the Go compiler's allocator][go-regalloc].
 Each basic block is allocated registers in a linear scan order, and the
 allocation state is propagated from a given basic block to its successors.
 Then, each block continues allocation from that initial state.
 #### Merge States
 Special care has to be taken when a block has multiple predecessors. We call
 this *fixing merge states*: for instance, consider the following:
 ```goat { width="30%" }
 .---.     .---.
 | BB0 |   | BB1 |
 '-+-'     '-+-'
   +----+----+
        |
        v
      .---.
     | BB2 |
      '---'
 ```
 if the live-out set of a given block `BB0` is different from the live-out set
 of a given block `BB1` and both are predecessors of a block `BB2`, then we need
 to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
 abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers
 or via stack; fixing merge states ensures that registers and stack are used
 consistently to pass values across the involved states.
 #### Spilling
 If the register allocator cannot find a free register for a given virtual
 (live) register, it needs to "spill" the value to the stack to get a free
 register, *i.e.,* stash it temporarily to stack.  When that virtual register is
 reused later, we will have to insert instructions to reload the value into a
 real register.
 While the procedure proceeds with allocation, the procedure also records all
 the virtual registers that transition to the "spilled" state, and inserts the
 reload instructions when those registers are reused later.
 The spill instructions are actually inserted at the end of the register
 allocation, after all the allocations and the merge states have been fixed. At
 this point, all the other potential sources of instability have been resolved,
 and we know where all the reloads happen.
 We insert the spills in the block that is the lowest common ancestor of all the
 blocks that reload the value.
 #### Clobbered Registers
 At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
 method iterates over the set of the allocated registers and compares them
 to a set of architecture-specific set `CalleeSavedRegisters`. If a register
 has been allocated, and it is present in this set, the register is marked as
 "clobbered", i.e., we now know that the register allocator will overwrite
 that value. Thus, these values will have to be spilled in the prologue.
 #### References
 Register allocation is a complex problem, possibly the most complicated
 part of the backend. The following references were used to implement the
 algorithm:
 - https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
 - https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
 - https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
 - https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
 - https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
 We suggest to refer to them to dive deeper in the topic.
 ### Example
 At the end of the register allocation phase, the basic blocks of our `abs`
 function look as follows (for `arm64`):
 ```asm
 L1 (SSA Block: blk0):
 	mov x2, x2
 	subs wzr, w2, #0x0
 	b.ge L2
 L3 (SSA Block: blk1):
 	mov x8, xzr
 	sub w8, w8, w2
 	mov x8, x8
 	b L4
 L2 (SSA Block: blk2):
 	mov x8, x2
 L4 (SSA Block: blk3):
 	mov x0, x8
 	ret
 ```
 Notice how the virtual registers have been all replaced by real registers, i.e.
 no register identifier is suffixed with `?`. This example is quite simple, and
 it does not require any spill.
 ### Code
 The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
 interfaces in `regalloc/api.go`.
 Essentially:
 - each architecture exposes iteration over basic blocks of a function
  (`regalloc.Function` interface)
 - each arch-specific basic block exposes iteration over instructions
  (`regalloc.Block` interface)
 - each arch-specific instruction exposes the set of registers it defines and
  uses  (`regalloc.Instr` interface)
 By defining these interfaces, the register allocation algorithm can assign real
 registers to virtual registers without dealing specifically with the target
 architecture.
 In practice, each interface is usually implemented by instantiating a common
 generic struct that comes already with an implementation of all or most of the
 required methods.  For instance,`regalloc.Function`is implemented by
 `backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
 `backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
 contains the instantiation of the `regalloc.RegisterInfo` struct, which
 declares, among others
 - the set of registers that are available for allocation, excluding, for
  instance, those that might be reserved by the runtime or the OS
 (`AllocatableRegisters`)
 - the registers that might be saved by the callee to the stack
  (`CalleeSavedRegisters`)
 ### Debug Flags
 - `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
  allocation procedure.
 - `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
  allocation result.
 ## Finalization and Encoding
 At the end of the register allocation phase, we have enough information to
 finally generate machine code (_encoding_). We are only missing the prologue
 and epilogue of the function.
 ### Prologue and Epilogue
 As usual, the **prologue** is executed before the main body of the function,
 and the **epilogue** is executed at the return. The prologue is responsible for
 setting up the stack frame, and the epilogue is responsible for cleaning up the
 stack frame and returning control to the caller.
 Generally, this means, at the very least:
 - saving the return address
 - a base pointer to the stack; or, equivalently, the height of the stack at the
  beginning of the function
 For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
 pointer:
 ```goat {width="100%" height="250"}
                (high address)                     (high address)
    RBP ----> +-----------------+                +-----------------+
              |      `...`      |                |      `...`      |
              |      ret Y      |                |      ret Y      |
              |      `...`      |                |      `...`      |
              |      ret 0      |                |      ret 0      |
              |      arg X      |                |      arg X      |
              |      `...`      |     ====>      |      `...`      |
              |      arg 1      |                |      arg 1      |
              |      arg 0      |                |      arg 0      |
              |   Return Addr   |                |   Return Addr   |
    RSP ----> +-----------------+                |    Caller_RBP   |
                 (low address)                   +-----------------+ <----- RSP, RBP
 ```
 While, on `arm64`, there is only a stack pointer `SP`:
 ```goat {width="100%" height="300"}
            (high address)                    (high address)
  SP ---> +-----------------+               +------------------+ <----+
          |      `...`      |               |      `...`       |      |
          |      ret Y      |               |      ret Y       |      |
          |      `...`      |               |      `...`       |      |
          |      ret 0      |               |      ret 0       |      |
          |      arg X      |               |      arg X       |      |  size_of_arg_ret.
          |      `...`      |     ====>     |      `...`       |      |
          |      arg 1      |               |      arg 1       |      |
          |      arg 0      |               |      arg 0       | <----+
          +-----------------+               |  size_of_arg_ret |
                                            |  return address  |
                                            +------------------+ <---- SP
             (low address)                     (low address)
 ```
 However, the prologue and epilogue might also be responsible for saving and
 restoring the state of registers that might be overwritten by the function
 ("clobbered"); and, if spilling occurs, prologue and epilogue are also
 responsible for reserving and releasing the space for the spilled values.
 For clarity, we make a distinction between the space reserved for the clobbered
 registers and the space reserved for the spilled values:
 - Spill slots are used to temporarily store the values that needs spilling as
  determined by the register allocator. This section must have a fix height,
 but its contents will change over time, as registers are being spilled and
 reloaded.
 - Clobbered registers are, similarly, determined by the register allocator, but
  they are stashed in the prologue and then restored in the epilogue.
 The procedure happens after the register allocation phase because at
 this point we have collected enough information to know how much space we need
 to reserve, and which registers are clobbered.
 Regardless of the architecture, after allocating this space, the stack will
 look as follows:
 ```goat {height="350"}
    (high address)
  +-----------------+
  |      `...`      |
  |      ret Y      |
  |      `...`      |
  |      ret 0      |
  |      arg X      |
  |      `...`      |
  |      arg 1      |
  |      arg 0      |
  | (arch-specific) |
  +-----------------+
  |    clobbered M  |
  |   ............  |
  |    clobbered 1  |
  |    clobbered 0  |
  |   spill slot N  |
  |   ............  |
  |   spill slot 0  |
  +-----------------+
     (low address)
 ```
 Note: the prologue might also introduce a check of the stack bounds. If there
 is no sufficient space to allocate the stack frame, the function will exit the
 execution and will try to grow it from the Go runtime.
 The epilogue simply reverses the operations of the prologue.
 ### Other Post-RegAlloc Logic
 The `backend.Machine.PostRegAlloc` method is invoked after the register
 allocation procedure; while its main role is to define the prologue and
 epilogue of the function, it also serves as a hook to perform other,
 arch-specific duty, that has to happen after the register allocation phase.
 For instance, on `amd64`, the constraints for some instructions are hard to
 express in a meaningful way for the register allocation procedure (for
 instance, the `div` instruction implicitly use registers `rdx`, `rax`).
 Instead, they are lowered with ad-hoc logic as part of the implementation
 `backend.Machine.PostRegAlloc` method.
 ### Encoding
 The final stage of the backend encodes the machine instructions into bytes and
 writes them to the target buffer. Before proceeding with the encoding, relative
 addresses in branching instructions or addressing modes are resolved.
 The procedure encodes the instructions in the order they appear in the
 function.
 ### Code
 - The prologue and epilogue are set up as part of the
  `backend.Machine.PostRegAlloc` method.
 - The encoding is done by the `backend.Machine.Encode` method.
 ### Debug Flags
 - `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
  function after the finalization phase.
 - `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
  representation of the function generated code as it is.
 - `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
  representation of the function generated code that can be disassembled.
 The reason for the distinction between the last two flags is that the generated
 code in some cases might not be disassemblable.
 `PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
 the generated code that can be disassembled, but cannot be executed.
 <hr>
 * Previous Section: [Front-End](../frontend/)
 * Next Section: [Appendix: Trampolines](../appendix/)
 [ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
 [go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
--- a/site/content/docs/how_the_optimizing_compiler_works/frontend.md
+++ b/site/content/docs/how_the_optimizing_compiler_works/frontend.md
@@ -0,0 +1,371 @@
 +++
 title = "How the Optimizing Compiler Works: Front-End"
 layout = "single"
 +++
 In this section we will discuss the phases in the front-end of the optimizing compiler:
 - [Translation to SSA](#translation-to-ssa)
 - [Optimization](#optimization)
 - [Block Layout](#block-layout)
 Every section includes an explanation of the phase; the subsection **Code**
 will include high-level pointers to functions and packages; the subsection **Debug Flags**
 indicates the flags that can be used to enable advanced logging of the phase.
 ## Translation to SSA
 We mentioned earlier that wazero uses an internal representation called an "SSA"
 form or "Static Single-Assignment" form, but we never explained what that is.
 In short terms, every program, or, in our case, every Wasm function, can be
 translated in a control-flow graph. The control-flow graph is a directed graph where
 each node is a sequence of statements that do not contain a control flow instruction,
 called a **basic block**. Instead, control-flow instructions are translated into edges.
 For instance, take the following implementation of the `abs` function:
 ```wasm
 (module
  (func (;0;) (param i32) (result i32)
     (if (result i32) (i32.lt_s (local.get 0) (i32.const 0))
        (then
            (i32.sub (i32.const 0) (local.get 0)))
        (else
            (local.get 0))
     )
  )
  (export "f" (func 0))
 )
 ```
 This is translated to the following block diagram:
 ```goat {width="100%" height="500"}
               +---------------------------------------------+
               |blk0: (exec_ctx:i64, module_ctx:i64, v2:i32) |
               |    v3:i32 = Iconst_32 0x0                   |
               |    v4:i32 = Icmp lt_s, v2, v3               |
               |    Brz v4, blk2                             |
               |    Jump blk1                                |
               +---------------------------------------------+
                                      |
                                      |
                      +---`(v4 != 0)`-+-`(v4 == 0)`---+
                      |                               |
                      v                               v
        +---------------------------+   +---------------------------+
        |blk1: () <-- (blk0)        |   |blk2: () <-- (blk0)        |
        |    v6:i32 = Iconst_32 0x0 |   |    Jump blk3, v2          |
        |    v7:i32 = Isub v6, v2   |   |                           |
        |    Jump blk3, v7          |   |                           |
        +---------------------------+   +---------------------------+
                      |                               |
                      |                               |
                      +-`{v5 := v7}`--+--`{v5 := v2}`-+
                                      |
                                      v
                      +------------------------------+
                      |blk3: (v5:i32) <-- (blk1,blk2)|
                      |    Jump blk_ret, v5          |
                      +------------------------------+
                                      |
                                 {return v5}
                                      |
                                      v
 ```
 We use the ["block argument" variant of SSA][ssa-blocks], which is also the same
 representation [used in LLVM's MLIR][llvm-mlir]. In this variant, each block
 takes a list of arguments. Each block ends with a branching instruction (Branch, Return,
 Jump, etc...) with an optional list of arguments; these arguments are assigned
 to the target block's arguments like a function.
 Consider the first block `blk0`.
 ```
 blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
    v3:i32 = Iconst_32 0x0
    v4:i32 = Icmp lt_s, v2, v3
    Brz v4, blk2
    Jump blk1
 ```
 You will notice that, compared to the original function, it takes two extra
 parameters (`exec_ctx` and `module_ctx`):
 1. `exec_ctx` is a pointer to `wazevo.executionContext`. This is used to exit the execution
   in the face of traps or for host function calls.
 2. `module_ctx`: pointer to `wazevo.moduleContextOpaque`. This is used, among other things,
   to access memory.
 It then takes one parameter `v2`, corresponding to the function parameter, and
 it defines two variables `v3`, `v4`. `v3` is the constant 0, `v4` is the result of
 comparing `v2` to `v3` using the `i32.lt_s` instruction. Then, it branches to
 `blk2` if `v4` is zero, otherwise it jumps to `blk1`.
 You might also have noticed that the instructions do not correspond strictly to
 the original Wasm opcodes. This is because, similarly to the wazero IR used by
 the old compiler, this is a custom IR.
 You will also notice that, _on the right-hand side of the assignments_ of any statement,
 no name occurs _twice_: this is why this form is called **single-assignment**.
 Finally, notice how `blk1` and `blk2` end with a jump to the last block `blk3`.
 ```
 blk1: ()
    ...
 	Jump blk3, v7
 blk2: ()
 	Jump blk3, v2
 blk3: (v5:i32)
    ...
 ```
 `blk3` takes an argument `v5`: `blk1` jumps to `bl3` with `v7` and `blk2` jumps
 to `blk3` with `v2`, meaning `v5` is effectively a rename of `v5` or `v7`,
 depending on the originating block. If you are familiar with the traditional
 representation of an SSA form, you will recognize that the role of block
 arguments is equivalent to the role of the *Phi (Φ) function*, a special
 function that returns a different value depending on the incoming edge; e.g., in
 this case: `v5 := Φ(v7, v2)`.
 ### Code
 The relevant APIs can be found under sub-package `ssa` and `frontend`.
 In the code, the terms *lower* or *lowering* are often used to indicate a mapping or a translation,
 because such transformations usually correspond to targeting a lower abstraction level.
 - Basic Blocks are represented by the type `ssa.Block`.
 - The SSA form is constructed using an `ssa.Builder`. The `ssa.Builder` is instantiated
  in the context of `wasm.Engine.CompileModule()`, more specifically in the method
  `frontend.Compiler.LowerToSSA()`.
 - The mapping between Wasm opcodes and the IR happens in `frontend/lower.go`,
  more specifically in the method `frontend.Compiler.lowerCurrentOpcode()`.
 - Because they are semantically equivalent, in the code, basic block parameters
  are sometimes referred to as "Phi values".
 #### Instructions and Values
 An `ssa.Instruction` is a single instruction in the SSA form. Each instruction might
 consume zero or more `ssa.Value`s, and it usually produces a single `ssa.Value`; some
 instructions may not produce any value (for instance, a `Jump` instruction).
 An `ssa.Value` is an abstraction that represents a typed name binding, and it is used
 to represent the result of an instruction, or the input to an instruction.
 For instance:
 ```
 blk1: () <-- (blk0)
    v6:i32 = Iconst_32 0x0
    v7:i32 = Isub v6, v2
    Jump blk3, v7
 ```
 `Iconst_32` takes no input value and produce value `v6`; `Isub` takes two input values (`v6`, `v2`)
 and produces value `v7`; `Jump` takes one input value (`v7`) and produces no value. All
 such values have the `i32` type. The wazero SSA's type system (`ssa.Type`) allows the following types:
 - `i32`: 32-bit integer
 - `i64`: 64-bit integer
 - `f32`: 32-bit floating point
 - `f64`: 64-bit floating point
 - `v128`: 128-bit SIMD vector
 For simplicity, we don't have a dedicated type for pointers. Instead, we use the `i64`
 type to represent pointer values since we only support 64-bit architectures,
 unlike traditional compilers such as LLVM.
 Values and instructions are both allocated from pools to minimize memory allocations.
 ### Debug Flags
 - `wazevoapi.PrintSSA` dumps the SSA form to the console.
 - `wazevoapi.FrontEndLoggingEnabled` dumps progress of the translation between Wasm
  opcodes and SSA instructions to the console.
 ## Optimization
 The SSA form makes it easier to perform a number of optimizations. For instance,
 we can perform constant propagation, dead code elimination, and common
 subexpression elimination. These optimizations either act upon the instructions
 within a basic block, or they act upon the control-flow graph as a whole.
 On a high, level, consider the following basic block, derived from the previous
 example:
 ```
 blk0: (exec_ctx:i64, module_ctx:i64)
    v2:i32 = Iconst_32 -5
    v3:i32 = Iconst_32  0
    v4:i32 = Icmp lt_s, v2, v3
    Brz v4, blk2
    Jump blk1
 ```
 It is pretty easy to see that the comparison in `v4` can be replaced by a
 constant `1`, because the comparison is between two constant values (-5, 0).
 Therefore, the block can be rewritten as such:
 ```
 blk0: (exec_ctx:i64, module_ctx:i64)
    v4:i32 = Iconst_32 1
    Brz v4, blk2
    Jump blk1
 ```
 However, we can now also see that the branch is always taken, and that the block
 `blk2` is never executed, so even the branch instruction and the constant
 definition `v4` can be removed:
 ```
 blk0: (exec_ctx:i64, module_ctx:i64)
    Jump blk1
 ```
 This is a simple example of constant propagation and dead code elimination
 occurring within a basic block. However, now  `blk2` is unreachable, because
 there is no other edge in the edge that points to it; thus it can be removed
 from the control-flow graph. This is an example of dead-code elimination that
 occurs at the control-flow graph level.
 In practice, because WebAssembly is a compilation target, these simple
 optimizations are often unnecessary. The optimization passes implemented in
 wazero are also work-in-progress and, at the time of writing, further work is
 expected to implement more advanced optimizations.
 ### Code
 Optimization passes are implemented by `ssa.Builder.RunPasses()`. An optimization
 pass is just a function that takes a ssa builder as a parameter.
 Passes iterate over the basic blocks, and, for each basic block, they iterate
 over the instructions. Each pass may mutate the basic block by modifying the instructions
 it contains, or it might change the entire shape of the control-flow graph (e.g. by removing
 blocks).
 Currently, there are two dead-code elimination passes:
 - `passDeadBlockEliminationOpt` acting at the block-level.
 - `passDeadCodeEliminationOpt` acting at instruction-level.
 Notably, `passDeadCodeEliminationOpt` also assigns an `InstructionGroupID` to each
 instruction. This is used to determine whether a sequence of instructions can be
 replaced by a single machine instruction during the back-end phase. For more details,
 see also the relevant documentation in `ssa/instructions.go`
 There are also simple constant folding passes such as `passNopInstElimination`, which
 folds and delete instructions that are essentially no-ops (e.g. shifting by a 0 amount).
 ### Debug Flags
 `wazevoapi.PrintOptimizedSSA` dumps the SSA form to the console after optimization.
 ## Block Layout
 As we have seen earlier, the SSA form instructions are contained within basic
 blocks, and the basic blocks are connected by edges of the control-flow graph.
 However, machine code is not laid out in a graph, but it is just a linear
 sequence of instructions.
 Thus, the last step of the front-end is to lay out the basic blocks in a linear
 sequence. Because each basic block, by design, ends with a control-flow
 instruction, one of the goals of the block layout phase is to maximize the number of
 **fall-through opportunities**. A fall-through opportunity occurs when a block ends
 with a jump instruction whose target is exactly the next block in the
 sequence. In order to maximize the number of fall-through opportunities, the
 block layout phase might reorder the basic blocks in the control-flow graph,
 and transform the control-flow instructions. For instance, it might _invert_
 some branching conditions.
 The end goal is to effectively minimize the number of jumps and branches in
 the machine code that will be generated later.
 ### Critical Edges
 Special attention must be taken when a basic block has multiple predecessors,
 i.e., when it has multiple incoming edges. In particular, an edge between two
 basic blocks is called a **critical edge** when, at the same time:
 - the predecessor has multiple successors **and**
 - the successor has multiple predecessors.
 For instance, in the example below the edge between `BB0` and `BB3`
 is a critical edge.
 ```goat { width="300" }
 ┌───────┐    ┌───────┐
 │  BB0  │━┓  │  BB1  │
 └───────┘ ┃  └───────┘
    │     ┃      │
    ▼     ┃      ▼
 ┌───────┐ ┃  ┌───────┐
 │  BB2  │ ┗━▶│  BB3  │
 └───────┘    └───────┘
 ```
 In these cases the critical edge is split by introducing a new basic block,
 called a **trampoline**, where the critical edge was.
 ```goat  { width="300" }
 ┌───────┐            ┌───────┐
 │  BB0  │──────┐     │  BB1  │
 └───────┘      ▼     └───────┘
    │    ┌──────────┐    │
    │    │trampoline│    │
    ▼    └──────────┘    ▼
 ┌───────┐      │     ┌───────┐
 │  BB2  │      └────▶│  BB3  │
 └───────┘            └───────┘
 ```
 For more details on critical edges read more at
 - https://en.wikipedia.org/wiki/Control-flow_graph
 - https://nickdesaulniers.github.io/blog/2023/01/27/critical-edge-splitting/
 ### Example
 At the end of the block layout phase, the laid out SSA for the `abs` function
 looks as follows:
 ```
 blk0: (exec_ctx:i64, module_ctx:i64, v2:i32)
 	v3:i32 = Iconst_32 0x0
 	v4:i32 = Icmp lt_s, v2, v3
 	Brz v4, blk2
 	Jump fallthrough
 blk1: () <-- (blk0)
 	v6:i32 = Iconst_32 0x0
 	v7:i32 = Isub v6, v2
 	Jump blk3, v7
 blk2: () <-- (blk0)
 	Jump fallthrough, v2
 blk3: (v5:i32) <-- (blk1,blk2)
 	Jump blk_ret, v5
 ```
 ### Code
 `passLayoutBlocks` implements the block layout phase.
 ### Debug Flags
 - `wazevoapi.PrintBlockLaidOutSSA` dumps the SSA form to the console after block layout.
 - `wazevoapi.SSALoggingEnabled` logs the transformations that are applied during this phase,
  such as inverting branching conditions or splitting critical edges.
 <hr>
 * Previous Section: [How the Optimizing Compiler Works](../)
 * Next Section: [Back-End](../backend/)
 [ssa-blocks]: https://en.wikipedia.org/wiki/Static_single-assignment_form#Block_arguments
 [llvm-mlir]: https://mlir.llvm.org/docs/Rationale/Rationale/#block-arguments-vs-phi-nodes