[bugfix]: Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training and fix create_causal_mask patch when cache_positions removed in transformers >5.3.0 by meichangsu1 · Pull Request #202 · modelscope/twinkle

meichangsu1 · 2026-05-22T11:23:31Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

This PR fixes padding-free and sequence-parallel compatibility issues with newer Transformers versions.

Main changes:
when transformers >5.3.0 ,cache_positoins is removed in this pr,so we :

Support forward methods whose kwargs are incompatible across Transformers versions.
Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training.
Skip Twinkle GDN padding-free patch when Transformers version natively supports Qwen3.5 cu_seq_lens_q in this pr
Adapt sequence-parallel causal mask patching for both old and new Transformers mask APIs.

Experiment results

end to end test results as followed:
sp:

Model	Transformers	Attention implementation	results
Qwen3-0.6B	5.3.0	flash_attention_2	pass
Qwen3-0.6B	5.3.0	sdpa	pass
Qwen3-0.6B	5.8.0	flash_attention_2	pass
Qwen3-0.6B	5.8.0	sdpa	pass
Qwen3.5-4B	5.3.0	flash_attention_2	pass
Qwen3.5-4B	5.3.0	sdpa	pass

padding_free :

Model	Transformers	Attention implementation	results
Qwen3.5-4B	5.3.0	flash_attention_2	pass
Qwen3.5-4B	5.9.0+	flash_attention_2	pass

Add `_call_with_supported_kwargs` utility to filter out unsupported keyword arguments when calling forward methods, preventing errors from incompatible function signatures. This fixes issues where `origin_forward` methods may not accept all passed kwargs.

…handling - Add `_call_with_supported_kwargs` and `_call_create_causal_mask` helpers to filter unsupported kwargs - Rename `cache_position` parameter to `q_length` in flash_attention_mask and sdpa_mask for clarity - Fix device detection in sdpa_mask when `q_length` is not a tensor - Ensure compatibility with models that don't accept `cache_position` in causal mask functions

…ill path In sequence parallel training, when newer Transformers versions pass q_length/q_offset instead of cache_position, the causal mask creation may still see the local shard length. This change restores the global query length for the no-cache prefill path while keeping cache/sliding paths with their upstream offsets. Also refactor GDN padding-free detection to use transformers version check instead of source inspection, supporting transformers >= 5.9.0.

gemini-code-assist

Code Review

This pull request introduces compatibility helpers and updates masking and forward logic to support varying transformers library versions, specifically handling changes in argument signatures such as cache_position. The review feedback primarily focuses on optimizing performance by caching inspect.signature results and moving signature checks out of hot-path function definitions to avoid significant overhead during the model's forward pass.

gemini-code-assist · 2026-05-22T11:26:55Z

+                if 'cache_position' in inspect.signature(masking_utils.origin_create_causal_mask).parameters:
+                    cache_position_or_past_key_values = torch.arange(
+                        0,
+                        input_embeds.shape[1],
+                        device=input_embeds.device,
+                    )


Similar to the optimization suggested for sdpa_mask, the signature of masking_utils.origin_create_causal_mask should be inspected once outside the create_causal_mask function to avoid overhead in the forward pass.

The version check for transformers >= 5.9.0 was removed because it is no longer needed. The GDN padding-free patch should always be applied regardless of the transformers version, as the native support check is handled elsewhere or the patch is required for all versions.

Add version check to only apply the chunk_gated_delta_rule cu_seqlens patch when using transformers versions below 5.9.0, preventing compatibility issues with newer releases.

qq_30035749 added 3 commits May 22, 2026 19:16

meichangsu1 marked this pull request as draft May 22, 2026 11:23

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

qq_30035749 added 2 commits May 24, 2026 15:32

fix: conditionally patch chunk_gated_delta_rule for transformers < 5.9.0

a0443e5

Add version check to only apply the chunk_gated_delta_rule cu_seqlens patch when using transformers versions below 5.9.0, preventing compatibility issues with newer releases.

meichangsu1 changed the title ~~Padding free bufix ljl~~ [bugfix]: Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training and fix create_causal_mask patch when cache_positions removed in transformers >5.3.0 May 25, 2026

fix gemini code review

1822cdd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix]: Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training and fix create_causal_mask patch when cache_positions removed in transformers >5.3.0#202

[bugfix]: Add compatibility handling for Qwen3.5 GatedDeltaNet padding-free training and fix create_causal_mask patch when cache_positions removed in transformers >5.3.0#202
meichangsu1 wants to merge 6 commits into
modelscope:mainfrom
meichangsu1:padding_free_bufix_ljl

meichangsu1 commented May 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

meichangsu1 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

meichangsu1 commented May 22, 2026 •

edited

Loading