Skip to content

fix(webgpu): stabilize qwen streaming and multimodal fallback#4

Open
leehack wants to merge 1 commit intomainfrom
fix/webgpu-utf8-streaming
Open

fix(webgpu): stabilize qwen streaming and multimodal fallback#4
leehack wants to merge 1 commit intomainfrom
fix/webgpu-utf8-streaming

Conversation

@leehack
Copy link
Owner

@leehack leehack commented Mar 8, 2026

This pull request introduces several improvements to the llama_webgpu_bridge.js and llama_webgpu_core.cpp files, focusing on more robust handling of text output, safer CPU fallback logic for multimodal models, and better normalization of media markers. The changes enhance stability, correctness, and compatibility, especially when working with multimodal models and streaming text.

Text output and streaming stability

  • Added the trimUnstableUtf8Tail function to ensure streamed text does not end with incomplete or unstable UTF-8 sequences, improving the reliability of token emission during generation. (js/llama_webgpu_bridge.js)
  • Refactored the streaming logic to track stable emitted text and only emit new, stable text segments, preventing duplicate or partial token emissions. (js/llama_webgpu_bridge.js) [1] [2] [3]

Multimodal model CPU fallback and option sanitization

  • Introduced _createCpuSafeMultimodalLoadOptions to sanitize model loading options for CPU fallback, limiting context size, thread count, and batch size for safer operation. (js/llama_webgpu_bridge.js)
  • Updated model loading and fallback logic to use the new CPU-safe options and improved detection of when a CPU fallback is necessary, including warning logic for users. (js/llama_webgpu_bridge.js) [1] [2] [3]

Media marker normalization

  • Enhanced media marker normalization to handle additional vision-related markers, improving prompt preprocessing for multimodal models. (src/llama_webgpu_core.cpp)

Token handling and generation correctness

  • Changed token handling to end generation when a control token is encountered and to avoid emitting control tokens in the output, improving correctness and preventing unwanted artifacts in generated text. (src/llama_webgpu_core.cpp)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant