Skip to content

[bugfix]: Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training and fix create_causal_mask patch when cache_positions removed in transformers >5.3.0#202

Draft
meichangsu1 wants to merge 6 commits into
modelscope:mainfrom
meichangsu1:padding_free_bufix_ljl

Conversation

@meichangsu1
Copy link
Copy Markdown
Collaborator

@meichangsu1 meichangsu1 commented May 22, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

This PR fixes padding-free and sequence-parallel compatibility issues with newer Transformers versions.

Main changes:
when transformers >5.3.0 ,cache_positoins is removed in this pr,so we :

  • Support forward methods whose kwargs are incompatible across Transformers versions.
  • Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training.
  • Skip Twinkle GDN padding-free patch when Transformers version natively supports Qwen3.5 cu_seq_lens_q in this pr
  • Adapt sequence-parallel causal mask patching for both old and new Transformers mask APIs.

Experiment results

end to end test results as followed:
sp:

Model Transformers Attention implementation results
Qwen3-0.6B 5.3.0 flash_attention_2 pass
Qwen3-0.6B 5.3.0 sdpa pass
Qwen3-0.6B 5.8.0 flash_attention_2 pass
Qwen3-0.6B 5.8.0 sdpa pass
Qwen3.5-4B 5.3.0 flash_attention_2 pass
Qwen3.5-4B 5.3.0 sdpa pass

padding_free :

Model Transformers Attention implementation results
Qwen3.5-4B 5.3.0 flash_attention_2 pass
Qwen3.5-4B 5.9.0+ flash_attention_2 pass

qq_30035749 added 3 commits May 22, 2026 19:16
Add `_call_with_supported_kwargs` utility to filter out unsupported keyword arguments when calling forward methods, preventing errors from incompatible function signatures. This fixes issues where `origin_forward` methods may not accept all passed kwargs.
…handling

- Add `_call_with_supported_kwargs` and `_call_create_causal_mask` helpers to filter unsupported kwargs
- Rename `cache_position` parameter to `q_length` in flash_attention_mask and sdpa_mask for clarity
- Fix device detection in sdpa_mask when `q_length` is not a tensor
- Ensure compatibility with models that don't accept `cache_position` in causal mask functions
…ill path

In sequence parallel training, when newer Transformers versions pass q_length/q_offset instead of cache_position, the causal mask creation may still see the local shard length. This change restores the global query length for the no-cache prefill path while keeping cache/sliding paths with their upstream offsets.

Also refactor GDN padding-free detection to use transformers version check instead of source inspection, supporting transformers >= 5.9.0.
@meichangsu1 meichangsu1 marked this pull request as draft May 22, 2026 11:23
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces compatibility helpers and updates masking and forward logic to support varying transformers library versions, specifically handling changes in argument signatures such as cache_position. The review feedback primarily focuses on optimizing performance by caching inspect.signature results and moving signature checks out of hot-path function definitions to avoid significant overhead during the model's forward pass.

Comment thread src/twinkle/model/transformers/strategy/sequence_parallel/__init__.py Outdated
Comment on lines +181 to +186
if 'cache_position' in inspect.signature(masking_utils.origin_create_causal_mask).parameters:
cache_position_or_past_key_values = torch.arange(
0,
input_embeds.shape[1],
device=input_embeds.device,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the optimization suggested for sdpa_mask, the signature of masking_utils.origin_create_causal_mask should be inspected once outside the create_causal_mask function to avoid overhead in the forward pass.

Comment thread src/twinkle/patch/gdn_padding_free.py Outdated
qq_30035749 added 2 commits May 24, 2026 15:32
The version check for transformers >= 5.9.0 was removed because it is no longer needed. The GDN padding-free patch should always be applied regardless of the transformers version, as the native support check is handled elsewhere or the patch is required for all versions.
Add version check to only apply the chunk_gated_delta_rule cu_seqlens patch when using transformers versions below 5.9.0, preventing compatibility issues with newer releases.
@meichangsu1 meichangsu1 changed the title Padding free bufix ljl [bugfix]: Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training and fix create_causal_mask patch when cache_positions removed in transformers >5.3.0 May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant