Metal: embed() hangs sporadically on Apple M5 Max (~10-30% of calls)

## Bug

`embed()` hangs sporadically on Apple M5 Max with Metal backend. The same query succeeds in ~0.7s on one call and hangs indefinitely (99% CPU on one core) on the next. The behavior is **not query-dependent** — identical inputs produce different outcomes. CPU-only mode (`gpu: false`) is 100% stable but ~50x slower.

## Reproduction

Minimal test: run the same embedding call 10 times sequentially. Expect 1-3 hangs per 10 runs.

```js
import { getLlama } from "node-llama-cpp";

const llama = await getLlama(); // Metal auto-detected
const model = await llama.loadModel({
  modelPath: "nomic-embed-text-v2-moe.Q8_0.gguf"  // 488 MB MoE embedding model
});
const ctx = await model.createEmbeddingContext();

for (let i = 0; i < 10; i++) {
  console.time(`run ${i}`);
  const vec = await ctx.getEmbeddingFor("test query about delegation");
  console.timeEnd(`run ${i}`);  // ~0.7s when it works, never completes when it hangs
}
```

In practice, observed via [qmd](https://github.com/tobi/qmd) CLI which calls `embed()` for vector search:

```bash
# Run 10 sequential vsearch calls, 8s timeout each:
export GGML_METAL_NO_RESIDENCY=1
for i in $(seq 1 10); do
  bun dist/cli/qmd.js vsearch "Delegieren" -n 3 &
  pid=$!; sleep 8
  kill -0 $pid 2>/dev/null && { kill $pid; echo "Run $i: HANG"; } || echo "Run $i: OK"
done

# Typical result: 9 OK, 1 HANG (without GGML_METAL_NO_RESIDENCY: 5-7 OK, 3-5 HANG)
```

## Workaround

`GGML_METAL_NO_RESIDENCY=1` reduces hang rate from ~30-50% to ~10% but does not eliminate it. This env var disables Metal residency sets for buffer allocation (related to llama.cpp autorelease/buffer management, see ggml-org/llama.cpp#18568).

## What doesn't help

- **Upgrading llama.cpp**: Rebuilt with b8783 (from b8390 in 3.18.1) — hang rate actually *increased* to ~40%
- **`GGML_METAL_TENSOR_DISABLE=1`**: No improvement, possibly worse
- **Both flags combined**: Worse than `GGML_METAL_NO_RESIDENCY` alone
- **AbortSignal**: `signal` parameter on `prompt()` is not respected during native Metal compute — the call blocks the event loop

## Observations

- The hang occurs in the native Metal compute pipeline, not in JS
- Process shows 99-100% CPU on one core during hang
- No crash, no error, no output — just infinite computation
- The model (nomic-embed-text-v2-moe) is a Mixture-of-Experts architecture — MoE routing may trigger different Metal kernel paths
- First call after process start seems slightly more likely to hang (cold start)
- Concurrent `embed()` calls guarantee deadlock (separate known issue)

## Related issues

- #552 — Similar hang on M4, fixed by rebuild (different root cause — prebuilt binary mismatch)
- ggml-org/llama.cpp#18568 — SIGBUS after many embeddings on Metal (autorelease/residency)
- ollama/ollama#13867 — M5-specific Metal4 shader regression

## Environment

- **node-llama-cpp**: 3.18.1
- **llama.cpp**: b8390 (prebuilt)
- **Hardware**: Apple M5 Max, 128 GB RAM
- **OS**: macOS 26.4.1 (Tahoe)
- **Runtime**: Bun 1.3.12 (also tested with Node 25.9.0 — same behavior)
- **Model**: nomic-embed-text-v2-moe Q8_0 (488 MB, 768 dimensions, MoE)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metal: embed() hangs sporadically on Apple M5 Max (~10-30% of calls) #595

Bug

Reproduction

Workaround

What doesn't help

Observations

Related issues

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Metal: embed() hangs sporadically on Apple M5 Max (~10-30% of calls) #595

Description

Bug

Reproduction

Workaround

What doesn't help

Observations

Related issues

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions