[tx] Implement Qwen 3.5 model architecture#1228
Conversation
|
/gemini review |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the Qwen 3.5 model architecture, which supports a mix of full attention and linear attention layers. The implementation is comprehensive, including changes to the model configuration, KV caching for the new layer types, and extensive tests comparing against the reference Hugging Face implementation. My review focuses on improving the robustness of the configuration handling and fixing a potential bug in model initialization. Overall, this is a solid contribution.
| if layer_types is None: | ||
| interval = getattr(config, "full_attention_interval", 4) | ||
| layer_types = [ | ||
| "linear_attention" if (i + 1) % interval else "full_attention" for i in range(config.num_hidden_layers) | ||
| ] | ||
| config.layer_types = layer_types |
There was a problem hiding this comment.
Modifying the config object in-place by setting config.layer_types can lead to unexpected side effects if the same config object is used elsewhere. A cleaner approach is to avoid mutating the config. You could compute layer_types and store it as a member of this class, then pass the specific layer_type to each Qwen3_5DecoderLayer during its initialization. This would require a small change to Qwen3_5DecoderLayer.__init__ to accept layer_type as an argument.
This PR implements the Qwen 3.5 model architecture, supporting mixed linear and full attention layers. For now we don't stack the layers yet to keep it simple. This PR also doesn't include MoE support yet.
Here are some examples you can run on 8xH100:
Qwen/Qwen3.5-27B Model
and then
Qwen/Qwen3.5-4B Model
and then
RL example