[WIP][Fix] GLM 5 set apply_rotary_pos_emb to is_neox_style=False && remove F.relu()#45017
[WIP][Fix] GLM 5 set apply_rotary_pos_emb to is_neox_style=False && remove F.relu()#45017JaredforReal wants to merge 8 commits intohuggingface:mainfrom
apply_rotary_pos_emb to is_neox_style=False && remove F.relu()#45017Conversation
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the GLM-MoE-DSA (GLM-5) rotary position embedding application to support interleaved (GPT‑J style) RoPE and removes a ReLU nonlinearity from the DSA indexer’s score computation.
Changes:
- Extend
apply_rotary_pos_embwith anis_neox_styleswitch and update GLM-MoE-DSA call sites to use interleaved RoPE (is_neox_style=False). - Remove
F.relu()from the DSA indexer scoring path. - Adjust how the DSA index mask is combined with the attention/causal mask.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
src/transformers/models/glm_moe_dsa/modular_glm_moe_dsa.py |
Implements interleaved-vs-NeoX RoPE logic, updates RoPE call sites, removes ReLU, and modifies mask combination logic. |
src/transformers/models/glm_moe_dsa/modeling_glm_moe_dsa.py |
Regenerated modeling file reflecting the same RoPE/scoring/mask-combination changes from the modular source. |
Signed-off-by: JaredforReal <w13431838023@gmail.com>
apply_rotary_pos_emb to is_neox_style=False && remove F.relu()apply_rotary_pos_emb to is_neox_style=False && remove F.relu()
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
ArthurZucker
left a comment
There was a problem hiding this comment.
Ty! We can probably add as a super slow test what you shared in the snipet
|
@ArthurZucker Yeah, the test is really super slow!! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Okay, waiting for your 🟢 to merge. Can you add in integration tests please? 🤗 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: glm_moe_dsa |
What does this PR do?
Get the rope operation right
Before: NeoX split-half style
After: GPT-J/interleaved style(
interleaved=Truesame asis_neox_style=Flase) the right oneGet rid of
F.reluReason:
F.reluworks withact_quantandrotate_activationfor BF16 to makeindexmore accurate with nums quantized to FP8act_quantandrotate_activationinvlove, addingF.relutoindex.scorewould make the output not reasonable (see below)Fixes PPL test
Test Example
Before:
After: