Skip to content

Optimize meld-fused async callback adapter patterns #70

@avrabe

Description

@avrabe

Context

meld fuses P3 async cross-component calls using a callback-driving adapter that calls [async-lift] entry, drives the [callback] loop via waitable-set-poll, and reads results from a task.return shim. This is correct but has optimization opportunities that loom should exploit.

Relates to #68 (cross-component optimization passes).

Current adapter overhead (unoptimized)

For synchronous-completing async functions (the common case — pure compute like fibonacci, collatz-steps), the callback loop runs zero iterations. The overhead is:

call [async-lift]         ;; 1 function call (wraps start_task → callback → impl)
unpack EXIT               ;; ~5 i32 ops (and, shr_u, eqz, br_if)
read result from shim     ;; 1 global.get or i32.load
return                    ;; 0 (value already on stack)

~10 instructions overhead vs a direct call $impl_function.

Optimization passes needed (in priority order)

1. Inline adapter → [async-lift] entry

The adapter calls [async-lift] with a direct call instruction. Loom should inline this. After inlining, the adapter body contains the start_task call and the unpack logic.

Prerequisite: basic function inlining.

2. Inline start_task → callback dispatch

Inside [async-lift], start_task is called which invokes the callback. The callback reference is stored in the indirect call table (element segment) at a CONSTANT index. Loom should:

  • Trace the call_indirect to the element segment
  • Determine the target is a constant function reference
  • Replace call_indirect with a direct call
  • Then inline the direct call

Prerequisite: indirect call devirtualization via element segment analysis.

3. Inline callback → actual computation

The [callback] function dispatches to the real computation based on the event code. For the STARTED event (first call), it runs the computation and calls task.return. Loom should inline through this dispatch.

After passes 1-3, the adapter body contains the actual computation inline. The async wrapper is eliminated.

4. Eliminate dead callback loop

After inlining, the adapter's loop structure is dead code — the inlined [async-lift] always returns EXIT for sync-completing functions. Loom should:

  • Determine the EXIT branch is always taken (constant propagation through the inlined code)
  • Eliminate the loop and WAIT/YIELD branches
  • Result: straight-line code

Prerequisite: constant propagation + dead branch elimination.

5. Eliminate task.return shim round-trip

The task.return shim writes the result to a global (scalar) or memory (compound). The adapter reads it immediately after. Loom should:

  • For globals: eliminate the global.set + global.get pair, keep value on stack
  • For memory: eliminate the store + load pair (memory-to-register promotion)

Prerequisite: store-load forwarding / copy propagation.

6. Eliminate start_task overhead

After inlining, the start_task setup code (waitable-set allocation, context slot writes) is dead — the task completes synchronously and the waitable set is never polled. Loom should eliminate this dead initialization.

Prerequisite: dead store elimination + escape analysis (the waitable set handle doesn't escape).

Expected result after all passes

Before (meld output):
  adapter(n) {
    packed = call [async-lift](n)      // → start_task → callback → impl(n) → task.return(result) → EXIT
    code = packed & 0xF
    if code != EXIT { ... loop ... }   // dead branch
    result = global.get $shim_result   // shim wrote this
    return result
  }

After (loom optimized):
  adapter(n) {
    return collatz_impl(n)             // direct call, zero overhead
  }

Passes that benefit ALL fused adapters (not just async)

Notes for implementation

  • Passes 1-3 require an inlining framework with indirect call devirtualization
  • Passes 4-6 are standard compiler optimizations (DCE, constant prop, store-load forwarding)
  • The async pattern is deterministic — wit-bindgen always generates the same structure
  • Testing: compare fused+optimized output against unfused wasmtime FACT adapter performance
  • All passes must preserve multi-memory semantics (memory indices are static immediates)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions