Skip to content

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#203

Closed
ys2025-AI wants to merge 0 commit into
modelscope:mainfrom
ys2025-AI:main
Closed

[NPU Feature] Qwen3.5 NPU FLA and fused-operator patches#203
ys2025-AI wants to merge 0 commit into
modelscope:mainfrom
ys2025-AI:main

Conversation

@ys2025-AI
Copy link
Copy Markdown
Collaborator

@ys2025-AI ys2025-AI commented May 24, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Description

This PR ports the Qwen3.5 FLA and Ascend NPU fused-operator patches.

Motivation

Qwen3.5 introduces hybrid attention layers (linear_attention + full_attention). The linear_attention path relies on chunk_gated_delta_rule from the flash-linear-attention (FLA) library, which contains CUDA-only Triton kernels. On Ascend NPU, these kernels must be redirected to MindSpeed Triton implementations to achieve comparable performance.

Without this patch, Qwen3.5 falls back to the pure PyTorch torch_chunk_gated_delta_rule, resulting in ~33% slower training on NPU.

Main changes

File Change
twinkle/kernel/chunk_gated_delta_rule.py New. MindSpeed Triton wrapper for chunk_gated_delta_rule. Re-exports the public API with identical signature to the FLA library.
twinkle/kernel/monkey_patch_npu.py Extended. Adds _patch_qwen3_5_fla() which: (1) spoofs transformers.utils.is_flash_linear_attention_available to bypass CUDA-only checks; (2) replaces module-level chunk_gated_delta_rule with the MindSpeed implementation; (3) traverses instantiated model layers to re-bind per-instance chunk_gated_delta_rule (required because Qwen3.5 caches the function reference at __init__ time).

Environment Variables

All FLA behavior is gated under the existing TWINKLE_NPU_PATCH hierarchy:

# Master switch for all NPU optimizations.
export TWINKLE_NPU_PATCH=1

# Enable fused operators (RMSNorm, RoPE, SwiGLU, SDPA).
export TWINKLE_NPU_FUSED_OPS=1

# Enable MoE Grouped MatMul.
export TWINKLE_NPU_MOE_PATCH=1

# Enable Flash Linear Attention for Qwen3.5.
# Default: 1 (enabled). Set to 0 to force torch fallback.
export TWINKLE_NPU_FLA=1

Related: modelscope/ms-swift#9223

Experiment results

  • Model: Qwen3.5-35B-A3B (40 layers, 30× linear_attention)
  • Hardware: Atlas 900 A3 (2 x NPU)
  • Dataset: GSM8K_ZH
  • Finetuning type: LoRA
  • Software: cann8.5.0 + torch/orch_npu 2.9.0 + MindSpeed 0.12.1 + triton-ascend 3.2.0 + transformers 5.9
Metric Baseline (FLA OFF) FLA ON (MindSpeed) Delta
Avg. duration per 10-step interval 57.7 s 43.8 s −24.1%
Avg. Loss −0.0005 (identical)
Avg. Grad Norm −0.024 (within noise)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances NPU optimization support for Transformers, specifically targeting Qwen3.5 and MoE architectures. Key changes include the introduction of a Flash Linear Attention (FLA) implementation using MindSpeed Triton kernels, support for MoE Packed Experts and Sparse Blocks, and dynamic model discovery for automatic patching. Feedback focused on performance optimizations, including removing dead code in the FLA implementation, replacing torch.histc with torch.bincount for expert counting, caching normalized expert weights to avoid redundant operations in the forward pass, and moving environment variable lookups out of hot paths.

Comment thread src/twinkle/kernel/chunk_gated_delta_rule.py Outdated
Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated
Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated
Comment thread src/twinkle/kernel/monkey_patch_npu.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants