Skip to content

Metal: embed() hangs sporadically on Apple M5 Max (~10-30% of calls) #595

@jowitte

Description

@jowitte

Bug

embed() hangs sporadically on Apple M5 Max with Metal backend. The same query succeeds in ~0.7s on one call and hangs indefinitely (99% CPU on one core) on the next. The behavior is not query-dependent — identical inputs produce different outcomes. CPU-only mode (gpu: false) is 100% stable but ~50x slower.

Reproduction

Minimal test: run the same embedding call 10 times sequentially. Expect 1-3 hangs per 10 runs.

import { getLlama } from "node-llama-cpp";

const llama = await getLlama(); // Metal auto-detected
const model = await llama.loadModel({
  modelPath: "nomic-embed-text-v2-moe.Q8_0.gguf"  // 488 MB MoE embedding model
});
const ctx = await model.createEmbeddingContext();

for (let i = 0; i < 10; i++) {
  console.time(`run ${i}`);
  const vec = await ctx.getEmbeddingFor("test query about delegation");
  console.timeEnd(`run ${i}`);  // ~0.7s when it works, never completes when it hangs
}

In practice, observed via qmd CLI which calls embed() for vector search:

# Run 10 sequential vsearch calls, 8s timeout each:
export GGML_METAL_NO_RESIDENCY=1
for i in $(seq 1 10); do
  bun dist/cli/qmd.js vsearch "Delegieren" -n 3 &
  pid=$!; sleep 8
  kill -0 $pid 2>/dev/null && { kill $pid; echo "Run $i: HANG"; } || echo "Run $i: OK"
done

# Typical result: 9 OK, 1 HANG (without GGML_METAL_NO_RESIDENCY: 5-7 OK, 3-5 HANG)

Workaround

GGML_METAL_NO_RESIDENCY=1 reduces hang rate from ~30-50% to ~10% but does not eliminate it. This env var disables Metal residency sets for buffer allocation (related to llama.cpp autorelease/buffer management, see ggml-org/llama.cpp#18568).

What doesn't help

  • Upgrading llama.cpp: Rebuilt with b8783 (from b8390 in 3.18.1) — hang rate actually increased to ~40%
  • GGML_METAL_TENSOR_DISABLE=1: No improvement, possibly worse
  • Both flags combined: Worse than GGML_METAL_NO_RESIDENCY alone
  • AbortSignal: signal parameter on prompt() is not respected during native Metal compute — the call blocks the event loop

Observations

  • The hang occurs in the native Metal compute pipeline, not in JS
  • Process shows 99-100% CPU on one core during hang
  • No crash, no error, no output — just infinite computation
  • The model (nomic-embed-text-v2-moe) is a Mixture-of-Experts architecture — MoE routing may trigger different Metal kernel paths
  • First call after process start seems slightly more likely to hang (cold start)
  • Concurrent embed() calls guarantee deadlock (separate known issue)

Related issues

Environment

  • node-llama-cpp: 3.18.1
  • llama.cpp: b8390 (prebuilt)
  • Hardware: Apple M5 Max, 128 GB RAM
  • OS: macOS 26.4.1 (Tahoe)
  • Runtime: Bun 1.3.12 (also tested with Node 25.9.0 — same behavior)
  • Model: nomic-embed-text-v2-moe Q8_0 (488 MB, 768 dimensions, MoE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions