[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches by ys2025-AI · Pull Request #203 · modelscope/twinkle

ys2025-AI · 2026-05-24T14:41:29Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Description

This PR ports the Qwen3.5 FLA and Ascend NPU fused-operator patches.

Motivation

Qwen3.5 introduces hybrid attention layers (linear_attention + full_attention). The linear_attention path relies on chunk_gated_delta_rule from the flash-linear-attention (FLA) library, which contains CUDA-only Triton kernels. On Ascend NPU, these kernels must be redirected to MindSpeed Triton implementations to achieve comparable performance.

Without this patch, Qwen3.5 falls back to the pure PyTorch torch_chunk_gated_delta_rule, resulting in ~33% slower training on NPU.

Main changes

File	Change
`twinkle/kernel/chunk_gated_delta_rule.py`	New. MindSpeed Triton wrapper for `chunk_gated_delta_rule`. Re-exports the public API with identical signature to the FLA library.
`twinkle/kernel/monkey_patch_npu.py`	Extended. Adds `_patch_qwen3_5_fla()` which: (1) spoofs `transformers.utils.is_flash_linear_attention_available` to bypass CUDA-only checks; (2) replaces module-level `chunk_gated_delta_rule` with the MindSpeed implementation; (3) traverses instantiated model layers to re-bind per-instance `chunk_gated_delta_rule` (required because Qwen3.5 caches the function reference at `__init__` time).

Environment Variables

All FLA behavior is gated under the existing TWINKLE_NPU_PATCH hierarchy:

# Master switch for all NPU optimizations.
export TWINKLE_NPU_PATCH=1

# Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA).
export TWINKLE_NPU_FUSED_OPS=1

# Enable MoE Grouped MatMul.
export TWINKLE_NPU_MOE_PATCH=1

# Enable Flash Linear Attention for Qwen3.5.
# Default: 1 (enabled). Set to 0 to force torch fallback.
export TWINKLE_NPU_FLA=1

Related: modelscope/ms-swift#9223

Experiment results

Model: Qwen3.5-35B-A3B (40 layers, 30× linear_attention)
Hardware: Atlas 900 A3 (2 x NPU)
Dataset: GSM8K_ZH
Finetuning type: LoRA
Software: cann8.5.0 + torch/orch_npu 2.9.0 + MindSpeed 0.12.1 + triton-ascend 3.2.0 + transformers 5.9

Metric	Baseline (FLA OFF)	FLA ON (MindSpeed)	Delta
Avg. duration per 10-step interval	57.7 s	43.8 s	−24.1%
Avg. Loss	—	—	−0.0005 (identical)
Avg. Grad Norm	—	—	−0.024 (within noise)

gemini-code-assist

Code Review

This pull request significantly enhances NPU optimization support for Transformers, specifically targeting Qwen3.5 and MoE architectures. Key changes include the introduction of a Flash Linear Attention (FLA) implementation using MindSpeed Triton kernels, support for MoE Packed Experts and Sparse Blocks, and dynamic model discovery for automatic patching. Feedback focused on performance optimizations, including removing dead code in the FLA implementation, replacing torch.histc with torch.bincount for expert counting, caching normalized expert weights to avoid redundant operations in the forward pass, and moving environment variable lookups out of hot paths.

gemini-code-assist Bot reviewed May 24, 2026

View reviewed changes

Comment thread src/twinkle/kernel/chunk_gated_delta_rule.py Outdated

Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated

Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated

Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated

tastelikefeet approved these changes May 25, 2026

View reviewed changes

ys2025-AI closed this May 26, 2026

ys2025-AI force-pushed the main branch from f21d290 to 6b2159c Compare May 26, 2026 06:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#203

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#203
ys2025-AI wants to merge 0 commit into
modelscope:mainfrom
ys2025-AI:main

ys2025-AI commented May 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ys2025-AI commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Description

Motivation

Main changes

Environment Variables

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ys2025-AI commented May 24, 2026 •

edited

Loading