Skip to content

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#204

Merged
tastelikefeet merged 2 commits into
modelscope:mainfrom
ys2025-AI:main
May 26, 2026
Merged

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#204
tastelikefeet merged 2 commits into
modelscope:mainfrom
ys2025-AI:main

Conversation

@ys2025-AI
Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Description

This PR ports the Qwen3.5 FLA and Ascend NPU fused-operator patches.

Motivation

Qwen3.5 introduces hybrid attention layers (linear_attention + full_attention). The linear_attention path relies on chunk_gated_delta_rule from the flash-linear-attention (FLA) library, which contains CUDA-only Triton kernels. On Ascend NPU, these kernels must be redirected to MindSpeed Triton implementations to achieve comparable performance.

Without this patch, Qwen3.5 falls back to the pure PyTorch torch_chunk_gated_delta_rule, resulting in ~33% slower training on NPU.

Main changes

File Change
twinkle/kernel/chunk_gated_delta_rule.py New. MindSpeed Triton wrapper for chunk_gated_delta_rule. Re-exports the public API with identical signature to the FLA library.
twinkle/kernel/monkey_patch_npu.py Extended. Adds _patch_qwen3_5_fla() which: (1) spoofs transformers.utils.is_flash_linear_attention_available to bypass CUDA-only checks; (2) replaces module-level chunk_gated_delta_rule with the MindSpeed implementation; (3) traverses instantiated model layers to re-bind per-instance chunk_gated_delta_rule (required because Qwen3.5 caches the function reference at __init__ time).

Environment Variables

All FLA behavior is gated under the existing TWINKLE_NPU_PATCH hierarchy:

# Master switch for all NPU optimizations.
export TWINKLE_NPU_PATCH=1

# Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA).
export TWINKLE_NPU_FUSED_OPS=1

# Enable MoE Grouped MatMul.
export TWINKLE_NPU_MOE_PATCH=1

# Enable Flash Linear Attention for Qwen3.5.
# Default: 1 (enabled). Set to 0 to force torch fallback.
export TWINKLE_NPU_FLA=1

Related: modelscope/ms-swift#9223

Experiment results

  • Model: Qwen3.5-35B-A3B (40 layers, 30× linear_attention)
  • Hardware: Atlas 900 A3 (2 x NPU)
  • Dataset: GSM8K_ZH
  • Finetuning type: LoRA
  • Software: cann8.5.0 + torch/orch_npu 2.9.0 + MindSpeed 0.12.1 + triton-ascend 3.2.0 + transformers 5.9
Metric Baseline (FLA OFF) FLA ON (MindSpeed) Delta
Avg. duration per 10-step interval 57.7 s 43.8 s −24.1%
Avg. Loss −0.0005 (identical)
Avg. Grad Norm −0.024 (within noise)

- Add kernelize_model integration to ep_fsdp2_lora and fsdp2 examples
- Support model parameter in apply_npu_patch for FLA instance patching
- Implement NPU-accelerated packed MoE experts with weight caching
- Add Qwen3.5 SparseMoeBlock forward with dual Transformers version support
- Support partial RoPE and gated RMSNorm with FP32 mode option
- Add MindSpeed Triton FLA backend integration for Qwen3.5
- Add environment variable controls for patch toggles
- Add dynamic model discovery for unknown model families
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Ascend NPU optimizations and monkey patches for Qwen3.5 and Qwen3.5-MoE models, including Flash Linear Attention (FLA) via MindSpeed Triton kernels, packed MoE experts, and dynamic model discovery. The review feedback highlights two critical issues in the patching logic: caching expert weights when gradients are required can break the PyTorch autograd graph, and calling model.named_modules() directly on a TransformersModel wrapper will raise an AttributeError during training.

Comment thread src/twinkle/kernel/monkey_patch_npu.py
Comment thread src/twinkle/kernel/monkey_patch_npu.py
- Skip weight cache when requires_grad=True to preserve autograd graph
- Resolve underlying PyTorch model from TransformersModel wrapper in FLA patch
@tastelikefeet tastelikefeet merged commit 03b86a1 into modelscope:main May 26, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants