Skip to content

[tx] Implement Qwen 3.5 model architecture#1228

Merged
pcmoritz merged 38 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-qwen-3.5
Mar 3, 2026
Merged

[tx] Implement Qwen 3.5 model architecture#1228
pcmoritz merged 38 commits intoNovaSky-AI:mainfrom
pcmoritz:tx-qwen-3.5

Conversation

@pcmoritz
Copy link
Collaborator

@pcmoritz pcmoritz commented Feb 26, 2026

This PR implements the Qwen 3.5 model architecture, supporting mixed linear and full attention layers. For now we don't stack the layers yet to keep it simple. This PR also doesn't include MoE support yet.

Here are some examples you can run on 8xH100:

Qwen/Qwen3.5-27B Model

uv run --extra gpu --extra tinker -m skyrl.tinker.api     --base-model Qwen/Qwen3.5-27B     --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 1, "tensor_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

and then

export TINKER_API_KEY="tml-dummy"
uv run --with wandb --with tinker sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3.5-27B lora_rank=1 max_length=128 train_on_what=LAST_ASSISTANT_MESSAGE

Qwen/Qwen3.5-4B Model

uv run --extra gpu --extra tinker -m skyrl.tinker.api     --base-model Qwen/Qwen3.5-4B     --backend-config '{"max_lora_adapters": 2, "max_lora_rank": 1, "tensor_parallel_size": 8, "train_micro_batch_size": 1, "shard_attention_heads": false}'

and then

export TINKER_API_KEY="tml-dummy"
uv run --with wandb --with tinker sl_loop.py     base_url=http://localhost:8000     model_name=Qwen/Qwen3.5-4B lora_rank=1 train_on_what=LAST_ASSISTANT_MESSAGE max_length=512

RL example

uv run --extra gpu --extra tinker -m skyrl.tinker.api \
    --base-model Qwen/Qwen3.5-2B \
    --backend-config '{"max_lora_adapters": 3, "max_lora_rank": 1, "tensor_parallel_size": 8, "train_micro_batch_size": 1, "sample_max_num_sequences": 64, "shard_attention_heads": false}' > out.log
export TINKER_API_KEY="tml-dummy"
uv run --with wandb --with tinker rl_loop.py     base_url=http://localhost:8000     model_name="Qwen/Qwen3.5-2B"     lora_rank=1 max_tokens=1024

Open with Devin

gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@pcmoritz pcmoritz added the tx label Mar 1, 2026
@pcmoritz pcmoritz changed the title [WIP] [tx] Implement Qwen 3.5 model architecture [tx] Implement Qwen 3.5 model architecture Mar 2, 2026
@pcmoritz
Copy link
Collaborator Author

pcmoritz commented Mar 2, 2026

/gemini review

gemini-code-assist[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@pcmoritz
Copy link
Collaborator Author

pcmoritz commented Mar 2, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Qwen 3.5 model architecture, which supports a mix of full attention and linear attention layers. The implementation is comprehensive, including changes to the model configuration, KV caching for the new layer types, and extensive tests comparing against the reference Hugging Face implementation. My review focuses on improving the robustness of the configuration handling and fixing a potential bug in model initialization. Overall, this is a solid contribution.

Comment on lines +550 to +555
if layer_types is None:
interval = getattr(config, "full_attention_interval", 4)
layer_types = [
"linear_attention" if (i + 1) % interval else "full_attention" for i in range(config.num_hidden_layers)
]
config.layer_types = layer_types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying the config object in-place by setting config.layer_types can lead to unexpected side effects if the same config object is used elsewhere. A cleaner approach is to avoid mutating the config. You could compute layer_types and store it as a member of this class, then pass the specific layer_type to each Qwen3_5DecoderLayer during its initialization. This would require a small change to Qwen3_5DecoderLayer.__init__ to accept layer_type as an argument.

@pcmoritz pcmoritz merged commit adcb93e into NovaSky-AI:main Mar 3, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant