Optimize meld-fused async callback adapter patterns

## Context

meld fuses P3 async cross-component calls using a callback-driving adapter that calls `[async-lift]` entry, drives the `[callback]` loop via `waitable-set-poll`, and reads results from a task.return shim. This is correct but has optimization opportunities that loom should exploit.

Relates to #68 (cross-component optimization passes).

## Current adapter overhead (unoptimized)

For synchronous-completing async functions (the common case — pure compute like fibonacci, collatz-steps), the callback loop runs zero iterations. The overhead is:

```
call [async-lift]         ;; 1 function call (wraps start_task → callback → impl)
unpack EXIT               ;; ~5 i32 ops (and, shr_u, eqz, br_if)
read result from shim     ;; 1 global.get or i32.load
return                    ;; 0 (value already on stack)
```

~10 instructions overhead vs a direct `call $impl_function`.

## Optimization passes needed (in priority order)

### 1. Inline adapter → [async-lift] entry
The adapter calls [async-lift] with a direct `call` instruction. Loom should inline this. After inlining, the adapter body contains the start_task call and the unpack logic.

**Prerequisite:** basic function inlining.

### 2. Inline start_task → callback dispatch
Inside [async-lift], `start_task` is called which invokes the callback. The callback reference is stored in the indirect call table (element segment) at a CONSTANT index. Loom should:
- Trace the `call_indirect` to the element segment
- Determine the target is a constant function reference
- Replace `call_indirect` with a direct `call`
- Then inline the direct call

**Prerequisite:** indirect call devirtualization via element segment analysis.

### 3. Inline callback → actual computation
The [callback] function dispatches to the real computation based on the event code. For the STARTED event (first call), it runs the computation and calls task.return. Loom should inline through this dispatch.

After passes 1-3, the adapter body contains the actual computation inline. The async wrapper is eliminated.

### 4. Eliminate dead callback loop
After inlining, the adapter's loop structure is dead code — the inlined [async-lift] always returns EXIT for sync-completing functions. Loom should:
- Determine the EXIT branch is always taken (constant propagation through the inlined code)
- Eliminate the loop and WAIT/YIELD branches
- Result: straight-line code

**Prerequisite:** constant propagation + dead branch elimination.

### 5. Eliminate task.return shim round-trip
The task.return shim writes the result to a global (scalar) or memory (compound). The adapter reads it immediately after. Loom should:
- For globals: eliminate the global.set + global.get pair, keep value on stack
- For memory: eliminate the store + load pair (memory-to-register promotion)

**Prerequisite:** store-load forwarding / copy propagation.

### 6. Eliminate start_task overhead
After inlining, the `start_task` setup code (waitable-set allocation, context slot writes) is dead — the task completes synchronously and the waitable set is never polled. Loom should eliminate this dead initialization.

**Prerequisite:** dead store elimination + escape analysis (the waitable set handle doesn't escape).

## Expected result after all passes

```
Before (meld output):
  adapter(n) {
    packed = call [async-lift](n)      // → start_task → callback → impl(n) → task.return(result) → EXIT
    code = packed & 0xF
    if code != EXIT { ... loop ... }   // dead branch
    result = global.get $shim_result   // shim wrote this
    return result
  }

After (loom optimized):
  adapter(n) {
    return collatz_impl(n)             // direct call, zero overhead
  }
```

## Passes that benefit ALL fused adapters (not just async)

- **Adapter inlining** (#68 item 1.1): sync adapters that copy scalars between memories → eliminate copies
- **Function deduplication** (#68 item 2.2): shared library code across components
- **Dead code elimination** (#68 item 2.1): functions unused after fusion

## Notes for implementation

- Passes 1-3 require an inlining framework with indirect call devirtualization
- Passes 4-6 are standard compiler optimizations (DCE, constant prop, store-load forwarding)
- The async pattern is deterministic — wit-bindgen always generates the same structure
- Testing: compare fused+optimized output against unfused wasmtime FACT adapter performance
- All passes must preserve multi-memory semantics (memory indices are static immediates)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize meld-fused async callback adapter patterns #70

Context

Current adapter overhead (unoptimized)

Optimization passes needed (in priority order)

1. Inline adapter → [async-lift] entry

2. Inline start_task → callback dispatch

3. Inline callback → actual computation

4. Eliminate dead callback loop

5. Eliminate task.return shim round-trip

6. Eliminate start_task overhead

Expected result after all passes

Passes that benefit ALL fused adapters (not just async)

Notes for implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize meld-fused async callback adapter patterns #70

Description

Context

Current adapter overhead (unoptimized)

Optimization passes needed (in priority order)

1. Inline adapter → [async-lift] entry

2. Inline start_task → callback dispatch

3. Inline callback → actual computation

4. Eliminate dead callback loop

5. Eliminate task.return shim round-trip

6. Eliminate start_task overhead

Expected result after all passes

Passes that benefit ALL fused adapters (not just async)

Notes for implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions