Context
meld fuses P3 async cross-component calls using a callback-driving adapter that calls [async-lift] entry, drives the [callback] loop via waitable-set-poll, and reads results from a task.return shim. This is correct but has optimization opportunities that loom should exploit.
Relates to #68 (cross-component optimization passes).
Current adapter overhead (unoptimized)
For synchronous-completing async functions (the common case — pure compute like fibonacci, collatz-steps), the callback loop runs zero iterations. The overhead is:
call [async-lift] ;; 1 function call (wraps start_task → callback → impl)
unpack EXIT ;; ~5 i32 ops (and, shr_u, eqz, br_if)
read result from shim ;; 1 global.get or i32.load
return ;; 0 (value already on stack)
~10 instructions overhead vs a direct call $impl_function.
Optimization passes needed (in priority order)
1. Inline adapter → [async-lift] entry
The adapter calls [async-lift] with a direct call instruction. Loom should inline this. After inlining, the adapter body contains the start_task call and the unpack logic.
Prerequisite: basic function inlining.
2. Inline start_task → callback dispatch
Inside [async-lift], start_task is called which invokes the callback. The callback reference is stored in the indirect call table (element segment) at a CONSTANT index. Loom should:
- Trace the
call_indirect to the element segment
- Determine the target is a constant function reference
- Replace
call_indirect with a direct call
- Then inline the direct call
Prerequisite: indirect call devirtualization via element segment analysis.
3. Inline callback → actual computation
The [callback] function dispatches to the real computation based on the event code. For the STARTED event (first call), it runs the computation and calls task.return. Loom should inline through this dispatch.
After passes 1-3, the adapter body contains the actual computation inline. The async wrapper is eliminated.
4. Eliminate dead callback loop
After inlining, the adapter's loop structure is dead code — the inlined [async-lift] always returns EXIT for sync-completing functions. Loom should:
- Determine the EXIT branch is always taken (constant propagation through the inlined code)
- Eliminate the loop and WAIT/YIELD branches
- Result: straight-line code
Prerequisite: constant propagation + dead branch elimination.
5. Eliminate task.return shim round-trip
The task.return shim writes the result to a global (scalar) or memory (compound). The adapter reads it immediately after. Loom should:
- For globals: eliminate the global.set + global.get pair, keep value on stack
- For memory: eliminate the store + load pair (memory-to-register promotion)
Prerequisite: store-load forwarding / copy propagation.
6. Eliminate start_task overhead
After inlining, the start_task setup code (waitable-set allocation, context slot writes) is dead — the task completes synchronously and the waitable set is never polled. Loom should eliminate this dead initialization.
Prerequisite: dead store elimination + escape analysis (the waitable set handle doesn't escape).
Expected result after all passes
Before (meld output):
adapter(n) {
packed = call [async-lift](n) // → start_task → callback → impl(n) → task.return(result) → EXIT
code = packed & 0xF
if code != EXIT { ... loop ... } // dead branch
result = global.get $shim_result // shim wrote this
return result
}
After (loom optimized):
adapter(n) {
return collatz_impl(n) // direct call, zero overhead
}
Passes that benefit ALL fused adapters (not just async)
Notes for implementation
- Passes 1-3 require an inlining framework with indirect call devirtualization
- Passes 4-6 are standard compiler optimizations (DCE, constant prop, store-load forwarding)
- The async pattern is deterministic — wit-bindgen always generates the same structure
- Testing: compare fused+optimized output against unfused wasmtime FACT adapter performance
- All passes must preserve multi-memory semantics (memory indices are static immediates)
Context
meld fuses P3 async cross-component calls using a callback-driving adapter that calls
[async-lift]entry, drives the[callback]loop viawaitable-set-poll, and reads results from a task.return shim. This is correct but has optimization opportunities that loom should exploit.Relates to #68 (cross-component optimization passes).
Current adapter overhead (unoptimized)
For synchronous-completing async functions (the common case — pure compute like fibonacci, collatz-steps), the callback loop runs zero iterations. The overhead is:
~10 instructions overhead vs a direct
call $impl_function.Optimization passes needed (in priority order)
1. Inline adapter → [async-lift] entry
The adapter calls [async-lift] with a direct
callinstruction. Loom should inline this. After inlining, the adapter body contains the start_task call and the unpack logic.Prerequisite: basic function inlining.
2. Inline start_task → callback dispatch
Inside [async-lift],
start_taskis called which invokes the callback. The callback reference is stored in the indirect call table (element segment) at a CONSTANT index. Loom should:call_indirectto the element segmentcall_indirectwith a directcallPrerequisite: indirect call devirtualization via element segment analysis.
3. Inline callback → actual computation
The [callback] function dispatches to the real computation based on the event code. For the STARTED event (first call), it runs the computation and calls task.return. Loom should inline through this dispatch.
After passes 1-3, the adapter body contains the actual computation inline. The async wrapper is eliminated.
4. Eliminate dead callback loop
After inlining, the adapter's loop structure is dead code — the inlined [async-lift] always returns EXIT for sync-completing functions. Loom should:
Prerequisite: constant propagation + dead branch elimination.
5. Eliminate task.return shim round-trip
The task.return shim writes the result to a global (scalar) or memory (compound). The adapter reads it immediately after. Loom should:
Prerequisite: store-load forwarding / copy propagation.
6. Eliminate start_task overhead
After inlining, the
start_tasksetup code (waitable-set allocation, context slot writes) is dead — the task completes synchronously and the waitable set is never polled. Loom should eliminate this dead initialization.Prerequisite: dead store elimination + escape analysis (the waitable set handle doesn't escape).
Expected result after all passes
Passes that benefit ALL fused adapters (not just async)
Notes for implementation