[PyTorch] Zero-initialize learnable softmax_offset in DotProductAttention by fjosw · Pull Request #2694 · NVIDIA/TransformerEngine

fjosw · 2026-02-20T20:11:51Z

Description

The PyTorch implementation of DotProductAttention initializes the learnable softmax_offset parameter with torch.empty(), which leaves it containing uninitialized memory. Unlike all other TransformerEngineBaseModule subclasses (Linear, LayerNormLinear, LayerNormMLP, GroupedLinear), DotProductAttention does not call self.reset_parameters() in its __init__, so the deferred initialization system that would normally overwrite the torch.empty() contents is never invoked. The JAX implementation explicitly uses nn.initializers.zeros for this parameter. This fix aligns the PyTorch behavior by using torch.zeros() directly. In Megatron-LM this is not a problem because the paramter is initialised explicitly but when used in isolation this can lead to problems.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Change torch.empty() to torch.zeros() when creating the learnable softmax_offset parameter in DotProductAttention, ensuring it is zero-initialized rather than containing uninitialized memory.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…tion DotProductAttention used torch.empty() for the learnable softmax_offset parameter. Unlike all other TransformerEngineBaseModule subclasses, DotProductAttention does not call reset_parameters() in __init__, so the deferred initialization that would normally overwrite the empty tensor is never invoked, leaving the parameter with uninitialized memory. The JAX implementation explicitly uses nn.initializers.zeros for this parameter. This aligns the PyTorch behavior by using torch.zeros(). Signed-off-by: Fabian Joswig <fjosw@users.noreply.github.com>

greptile-apps · 2026-02-20T20:14:13Z

Greptile Summary

This PR fixes a parameter initialization bug in DotProductAttention where the learnable softmax_offset parameter was initialized with torch.empty(), leaving it with uninitialized memory values. The fix changes this to torch.zeros() to ensure proper zero-initialization.

Key Points:

The JAX implementation explicitly uses nn.initializers.zeros for this parameter (line 165 in transformer_engine/jax/flax/transformer.py)
The "off-by-one" softmax type already uses torch.zeros() in the same file (line 436-438)
Unlike other TransformerEngineBaseModule subclasses, DotProductAttention does not call self.reset_parameters() in __init__, so the deferred initialization system never overwrites the uninitialized values
This bug would only manifest when using DotProductAttention in isolation without Megatron-LM's initialization
Existing tests cover the learnable softmax type and should verify the fix works correctly

Confidence Score: 5/5

This PR is safe to merge with minimal risk
Single-line bug fix that corrects improper initialization, aligns with JAX implementation and the "off-by-one" case in the same file, has existing test coverage, and follows established patterns
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	Changed `torch.empty()` to `torch.zeros()` for learnable `softmax_offset` parameter initialization, ensuring zero-initialization instead of uninitialized memory

_{Last reviewed commit: 4aab56c}

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[PyTorch] Zero-initialize learnable softmax_offset in DotProductAttention#2694

[PyTorch] Zero-initialize learnable softmax_offset in DotProductAttention#2694
fjosw wants to merge 1 commit intoNVIDIA:mainfrom
fjosw:fix/softmax-offset-zero-init-v2

fjosw commented Feb 20, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

fjosw commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Feb 20, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fjosw commented Feb 20, 2026 •

edited

Loading