add SP support for `flash_varlen_hub` backend by zhtmike · Pull Request #13479 · huggingface/diffusers

zhtmike · 2026-04-15T09:51:35Z

What does this PR do?

This PR adds support for attention mask input when using the attention backend with set_attention_backend("flash"). With this change, QwenImagePipeline can run with the flash backend w/ or w/o Ulysses SP.

For FlashAttention 2, it is not feasible to use _wrapped_flash_attn_forward directly when a mask is applied. To maintain compatibility with the current interface, we introduce an additional branch for FlashAttention to handle attention masks.

# forward pass
-. w/o mask: _wrapped_flash_attn_forward()
-. w/ mask (new): _pack_qkv() --> _wrapped_flash_attn_varlen_forward() --> unpack()
# backward pass
-. w/o mask: stored tensor ->  _wrapped_flash_attn_forward()
-. w/ mask (new): stored packed tensor -> _wrapped_flash_attn_varlen_backward() -> unpack()

I haven't tested with ring attention, so it is left as unimplemented.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@sayakpaul

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

zhtmike · 2026-04-15T09:55:07Z

code snippet to show it works

import torch
import torch.distributed as dist
import argparse
import os
from diffusers import QwenImagePipeline
from diffusers import ContextParallelConfig


def parse_args():
    parser = argparse.ArgumentParser(
        description="Test Qwen-Image with Context Parallelism")
    return parser.parse_args()


args = parse_args()

if dist.is_available():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    device = torch.device("cuda", rank % torch.cuda.device_count())
    world_size = dist.get_world_size()
    torch.cuda.set_device(device)
else:
    rank = 0
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    world_size = 1

model_id = os.path.expanduser("~/models/Qwen/Qwen-Image")

pipe = QwenImagePipeline.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
)
pipe.to(device)

pipe.transformer.set_attention_backend("flash")   # <--------- here 
if world_size > 1:
    from diffusers import QwenImageTransformer2DModel
    assert isinstance(pipe.transformer, QwenImageTransformer2DModel)
    pipe.transformer.enable_parallelism(config=ContextParallelConfig(
        ulysses_degree=world_size))

pipe.set_progress_bar_config(disable=rank != 0)

positive_magic = {
    "en": ", Ultra HD, 4K, cinematic composition.",  # for english prompt
    "zh": ", 超清，4K，电影级构图.",  # for chinese prompt
}
prompts = [
    "A coffee shop entrance features a chalkboard sign reading "
    '"Qwen Coffee 😊 $2 per cup," with a neon light beside it '
    'displaying "通义千问". Next to it hangs a poster showing a '
    "beautiful Chinese woman, and beneath the poster is written "
    '"π≈3.1415926-53589793-23846264-33832795-02384197". '
    "Ultra HD, 4K, cinematic composition",
    "A cute cat with long hair sitting on a sofa, Ultra HD, 4K, cinematic composition."
]

inputs = {
    "prompt": [p + positive_magic["en"] for p in prompts],
    "generator": torch.Generator(device="cpu").manual_seed(0),
    "true_cfg_scale": 4.0,
    "negative_prompt": " ",
    "num_inference_steps": 50,
    "num_images_per_prompt": 1,
    "height": 1024,
    "width": 1024,
}

with torch.inference_mode():
    output = pipe(**inputs)
    for i, output_image in enumerate(output.images):
        if world_size > 1:
            save_path = f"output_image_ulysses{world_size}_{i}.png"
        else:
            save_path = f"output_image_{i}.png"
        if rank == 0:
            output_image.save(save_path)
            print(f"image saved at {save_path}")

if dist.is_initialized():
    dist.destroy_process_group()

Produces the following images:

zhtmike · 2026-04-16T03:12:51Z

Hi @sayakpaul, the PR is ready for review, please take a look once you have time

sayakpaul

Thanks a lot for the PR! I left some comments, LMK what you think.

Should it be propagated to FA3, too, perhaps in a different PR?

sayakpaul · 2026-04-16T05:17:24Z

    try:
        from flash_attn import flash_attn_func, flash_attn_varlen_func
-        from flash_attn.flash_attn_interface import _wrapped_flash_attn_backward, _wrapped_flash_attn_forward
+        from flash_attn.flash_attn_interface import (


WDYT of constraining the changes only to FLASH_HUB?

diffusers/src/diffusers/models/attention_dispatch.py

Line 352 in 947bc23

AttentionBackendName.FLASH_VARLEN_HUB: _HubKernelConfig(

This way, people won't have to build the flash attention wheel locally.

We will be deprecating the non-Hub variants for FLASH and FLASH_3` soonish anyway.

OK let me move to the hub version of flash attention 2 then

Done. Moved to FLASH_HUB.

sayakpaul · 2026-04-16T05:19:24Z

+
+
+@dataclass
+class _VarlenPackedInputs:


Does it apply to all varlen attention kernels, though? Or does it come to fruition only during CP?

We do have VARLEN implementations of a few backends already:

diffusers/src/diffusers/models/attention_dispatch.py

Line 352 in 947bc23

AttentionBackendName.FLASH_VARLEN_HUB: _HubKernelConfig(

Any reason to use dataclasses for this? Won't it be better to apply the transformations inline for readability?

We do have VARLEN implementations of a few backends already:

Yes but seems all VARLEN implementation only works with non-CP case. So it is fine to work with transformation inline for non-CP.

Any reason to use dataclasses for this? Won't it be better to apply the transformations inline for readability?

Let me drop dataclasses, it is just my personal perference

sayakpaul · 2026-04-16T05:21:07Z

+        return packed_out.view(self.batch_size, self.seq_len_q, *packed_out.shape[1:])
+
+
+def _pack_qkv(


Why do we need this function if we decided to do the transformations in-line?

sayakpaul · 2026-04-16T05:28:00Z


+    if attn_mask is not None:
+        if return_lse:
+            raise NotImplementedError("`return_lse=True` with `attn_mask` is not yet supported for flash-attn 2.")


That means Ring isn't supported yet right?

Yes. Not supported yet.

sayakpaul · 2026-04-16T05:30:53Z

+import torch
+import torch.nn.functional as F
+
+from diffusers.models.attention_dispatch import (


I think we can do the testing in #13182.

Dropped. Let us wait #13182 first?

Added test coverage. The xfail for _flash_3_hub will be dropped once flash_3 with attention_mask is added

Result:

tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[True-native-ulysses_degree] PASSED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[True-native-ring_degree] SKIPPED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[True-flash_hub-ulysses_degree] PASSED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[True-flash_hub-ring_degree] XFAIL tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[True-_flash_3_hub-ulysses_degree] XFAIL tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[True-_flash_3_hub-ring_degree] XFAIL tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[False-native-ulysses_degree] PASSED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[False-native-ring_degree] SKIPPED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[False-flash_hub-ulysses_degree] PASSED tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[False-flash_hub-ring_degree] XFAIL tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[False-_flash_3_hub-ulysses_degree] XFAIL tests/models/transformers/test_models_transformer_qwenimage.py::TestQwenImageTransformerContextParallelAttnBackends::test_context_parallel_attn_backend_inference[False-_flash_3_hub-ring_degree] XFAIL

zhtmike · 2026-04-16T07:18:36Z

Thanks a lot for the PR! I left some comments, LMK what you think.

Should it be propagated to FA3, too, perhaps in a different PR?

Agree. I will implement FA3 as well, in different PR

sayakpaul · 2026-04-16T15:31:51Z

@claude could you do a review?

github-actions · 2026-04-16T15:32:12Z

Claude finished @sayakpaul's task in 3m 25s —— View job

Review of PR #13479: Add attention mask input support for flash backend

Read repository review guidelines (AGENTS.md, models.md, modular.md)
Read the PR diff
Read relevant context in attention_dispatch.py
Post review feedback

Overall: The approach is correct — using flash_attn_varlen_func / _wrapped_flash_attn_varlen_{forward,backward} for masked attention is the standard way to handle padding masks in FlashAttention 2. The Q stays fully packed (flatten), K/V are selectively packed by mask indices, which is correct for key-side padding masks. The backward pass properly dispatches between the varlen and regular paths via ctx.is_varlen_masked. Ring attention is correctly guarded.

I have a few comments:

1. Missing None-check on `varlen_kernel_fn` (potential runtime error)

In _flash_attention_hub (non-CP masked path, ~line 2591), varlen_kernel_fn is used without a None guard:

varlen_func = _HUB_KERNELS_REGISTRY[AttentionBackendName.FLASH_HUB].varlen_kernel_fn
out_packed = varlen_func(...)  # AttributeError if varlen_kernel_fn is None

The forward_op path correctly checks wrapped_varlen_forward_fn is None and raises a descriptive RuntimeError. The same pattern should be applied here:

varlen_func = _HUB_KERNELS_REGISTRY[AttentionBackendName.FLASH_HUB].varlen_kernel_fn
if varlen_func is None:
    raise RuntimeError(
        "Flash attention hub kernels must expose `flash_attn_varlen_func` for masked attention."
    )

In practice, _maybe_download_kernel_for_backend should populate it, but a guard is safer and consistent with the forward_op style.

Fix this →

2. Dead code: unreachable `return_lse` handling (very minor)

In _flash_attention_hub (~lines 2605-2606):

if return_lse:
    out_packed, lse, *_ = out_packed

This is unreachable because return_lse=True already raises NotImplementedError at line 2575-2578. It's harmless but could confuse readers into thinking the return_lse path is partially supported. Consider removing it.

3. Code duplication between `_flash_attention_hub_forward_op` and `_flash_attention_hub` (nit)

The mask handling logic (~30 lines: normalize mask → compute cu_seqlens → compute indices → pack Q/K/V → call varlen → reshape output) is duplicated between:

_flash_attention_hub_forward_op (lines ~1266-1336, CP path with ctx saving)
_flash_attention_hub (lines ~2574-2608, non-CP path)

I recognize this follows the existing codebase pattern where the forward_op (using _wrapped_* internals for backward support) and the main function (using the public API) are kept separate. So this is fine structurally — just noting it for awareness.

4. Ulysses + mask interaction looks correct

I traced through the TemplatedUlyssesAttention.forward path: after the all-to-all, Q/K/V have shape (B, S_FULL, H_LOCAL, D). The full-sized mask (B, S_KV) matches the now-full KV sequence length, so _normalize_attn_mask and the varlen packing work correctly. The backward path also works because the packed tensors and indices are saved to ctx.

5. Constraint addition looks good

Adding _check_attn_mask_or_causal to the FLASH_HUB constraints is correct — FlashAttention varlen doesn't support both attn_mask and is_causal=True simultaneously.

Summary

The PR is well-structured and the core logic is correct. The main actionable item is #1 (add a None-check on varlen_kernel_fn). Item #2 (dead code) is optional cleanup. The rest looks good.

sayakpaul

IIUC we have to rely on varlen if attention mask is specified. If that's the case we should rather specify

diffusers/src/diffusers/models/attention_dispatch.py

Line 231 in f7fd76a

FLASH_VARLEN_HUB = "flash_varlen_hub"

sayakpaul · 2026-04-27T12:49:43Z

    wrapped_forward_fn: Callable | None = None
    wrapped_backward_fn: Callable | None = None
+    # Some backends (e.g. flash attention) have separate kernels for variable-length inputs
+    varlen_function_attr: str | None = None


Instead of introducing a new attribute for varlen, I think we should do something similar to:

diffusers/src/diffusers/models/attention_dispatch.py

Line 352 in f7fd76a

AttentionBackendName.FLASH_VARLEN_HUB: _HubKernelConfig(

HuggingFaceDocBuilderDev · 2026-04-27T13:08:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zhtmike · 2026-04-27T13:38:44Z

IIUC we have to rely on varlen if attention mask is specified. If that's the case we should rather specify

diffusers/src/diffusers/models/attention_dispatch.py

Line 231 in f7fd76a

FLASH_VARLEN_HUB = "flash_varlen_hub"

Agree. Let me implement CP with varlen kernel instead. It will look cleaner.

zhtmike · 2026-05-04T10:00:27Z

IIUC we have to rely on varlen if attention mask is specified. If that's the case we should rather specify

diffusers/src/diffusers/models/attention_dispatch.py

Line 231 in f7fd76a

FLASH_VARLEN_HUB = "flash_varlen_hub"

Agree. Let me implement CP with varlen kernel instead. It will look cleaner.

Hi @sayakpaul, I have reworked the CP with the varlen kernel. Now QwenImagePipeline supports CP with flash_varlen_hub, as well as FluxPipeline.

I have tested with TestQwenImageTransformerContextParallelAttnBackends and TestFluxTransformerContextParallelAttnBackends, and both tests passed without errors.

And previous _flash_varlen_attention_hub does not support non-contiguous mask like 111001111 style in QwenImage, so I have also fixed it.

Can you take a look? Thanks!

sayakpaul

Thanks for the refactor. I don't really understand some of the big changes to the existing codebase. So, please provide reasoning behind them.

sayakpaul · 2026-05-06T03:14:17Z

+def _padded_to_unpad(tensor: torch.Tensor, indices: torch.Tensor) -> torch.Tensor:
+    """gather valid tokens from a padded `(batch, seq, ...)` tensor into a packed `(nnz, ...)` tensor."""
+    return tensor.reshape(-1, *tensor.shape[2:])[indices]


This is just a one-liner utility. Let's use it directly in the caller sites.

sayakpaul · 2026-05-06T03:18:42Z

-    (_, seqlens_k), (cu_seqlens_q, cu_seqlens_k), (max_seqlen_q, max_seqlen_k) = (
-        _prepare_for_flash_attn_or_sage_varlen(
-            batch_size, seq_len_q, seq_len_kv, attn_mask=attn_mask, device=query.device
+    if _parallel_config is not None:


Let's follow this pattern:

diffusers/src/diffusers/models/attention_dispatch.py

Line 2722 in b62614a

if _parallel_config is None:

sayakpaul · 2026-05-06T03:22:13Z

+    if attn_mask is not None:
+        attn_mask_2d = _normalize_attn_mask(attn_mask, batch_size, seq_len_kv)
+        (_, _), (cu_seqlens_q, cu_seqlens_k), (max_seqlen_q, max_seqlen_k) = (
+            _prepare_for_flash_attn_or_sage_varlen_with_mask(batch_size, seq_len_q, attn_mask_2d, query.device)
+        )
+        indices_k = attn_mask_2d.flatten().nonzero(as_tuple=False).flatten()
+        key_packed = _padded_to_unpad(key, indices_k)
+        value_packed = _padded_to_unpad(value, indices_k)
+    else:
+        (_, _), (cu_seqlens_q, cu_seqlens_k), (max_seqlen_q, max_seqlen_k) = (
+            _prepare_for_flash_attn_or_sage_varlen_without_mask(batch_size, seq_len_q, seq_len_kv, query.device)
+        )
+        key_packed = key.flatten(0, 1)
+        value_packed = value.flatten(0, 1)


What is happening here?

for example, assume batch_size=2, seq_len_kv=4

two branches:

if there is mask like

batch 0: [T, T, T, F] ← 3 real tokens batch 1: [T, T, F, F] ← 2 real tokens

normalizes 4D-mask to 2D-mask [batch, seq_kv] if necessary

computes cu_seqlens_k = [0, 3, 5] (cumulative token counts: 0 → 3 → 5)

finds indices_k = [0, 1, 2, 4, 5] (the flat indices of the True positions)

gathers only those rows -> key_packed with shape (5, heads, dim)

if there is no mask

computes cu_seqlens_k = [0, 4, 8]

finds indices_k = [0, 1, 2, 3, 4, 5, 6, 7]

key_packed with shape (8, heads, dim)

Then feed them into varlen kernel.

It is a vectorized way of handling key/value packing compared with old for-loop method. And it can also handle QwenImage-like mask of [T, T, T, F, F, T], the old way cannot.

sayakpaul · 2026-05-06T03:26:10Z

                marks=pytest.mark.skipif(not is_kernels_available(), reason="`kernels` is not available."),
            ),
+            pytest.param(
+                "flash_varlen_hub",


Should varlen tests get their own testing mixin class?

I think the varlen kernel can handle all the cases supported by the non-varlen kernel. Personally, I prefer to put them together.

sayakpaul · 2026-05-06T03:27:35Z

+    """Context Parallel inference x attention backends tests for QwenImage Transformer"""
+
+    # flash_hub and _flash_3_hub do not support attn_mask
+    unsupported_attn_backends = ["flash_hub", "_flash_3_hub"]


Any not varlen attention backend would fail no? If so, I would rather do something like

if "varlen" not in attention_backend: pytest.skip(...)

like FluxPipeline, it can also support varlen kernels after this change.

I’m not sure what the most suitable place is to put this

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

add mask support for flash backend

df8994d

github-actions Bot added models tests size/L PR with diff > 200 LOC labels Apr 15, 2026

fix test case

2d12f46

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 15, 2026

refactor test

003fa34

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

add protection

bc61551

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

fix comment

86fec43

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

sayakpaul reviewed Apr 16, 2026

View reviewed changes

update according to suggestion

5034b2b

github-actions Bot added size/L PR with diff > 200 LOC and removed tests size/L PR with diff > 200 LOC labels Apr 16, 2026

revert change

e05bb28

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 16, 2026

fix according to claude review

534fdc1

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 27, 2026

add test converage for QwenImage

1cd670b

github-actions Bot added tests size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels Apr 27, 2026

sayakpaul reviewed Apr 27, 2026

View reviewed changes

zhtmike added 3 commits May 4, 2026 11:18

Merge origin/main into flash-attn-mask-dev

db7b8d4

add SP support and fix non-contiguous mask for flash_varlen kernel

99e1660

revert change

3d8cbf4

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels May 4, 2026

zhtmike changed the title ~~Add attention mask input support for flash backend~~ add SP support for flash_varlen_hub backend May 4, 2026

sayakpaul reviewed May 6, 2026

View reviewed changes

Update tests/models/testing_utils/parallelism.py

1b39db4

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels May 6, 2026

Update tests/models/testing_utils/parallelism.py

04a1bf5

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels May 6, 2026

zhtmike added 2 commits May 6, 2026 14:45

drop _padded_to_unpad

37a6db5

follow if _parallel_config is None pattern

849062a

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels May 6, 2026

Merge branch 'main' into flash-attn-mask

b042eb0

github-actions Bot added size/L PR with diff > 200 LOC and removed size/L PR with diff > 200 LOC labels May 6, 2026

		return packed_out.view(self.batch_size, self.seq_len_q, *packed_out.shape[1:])


		def _pack_qkv(



		@dataclass
		class _VarlenPackedInputs:

Conversation

zhtmike commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

zhtmike commented Apr 15, 2026

Uh oh!

zhtmike commented Apr 16, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike commented Apr 16, 2026

Uh oh!

sayakpaul commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #13479: Add attention mask input support for flash backend

1. Missing None-check on varlen_kernel_fn (potential runtime error)

2. Dead code: unreachable return_lse handling (very minor)

3. Code duplication between _flash_attention_hub_forward_op and _flash_attention_hub (nit)

4. Ulysses + mask interaction looks correct

5. Constraint addition looks good

Summary

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 27, 2026

Uh oh!

zhtmike commented Apr 27, 2026

Uh oh!

zhtmike commented May 4, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhtmike commented Apr 15, 2026 •

edited

Loading

github-actions Bot commented Apr 16, 2026 •

edited

Loading

1. Missing None-check on `varlen_kernel_fn` (potential runtime error)

2. Dead code: unreachable `return_lse` handling (very minor)

3. Code duplication between `_flash_attention_hub_forward_op` and `_flash_attention_hub` (nit)

zhtmike May 6, 2026 •

edited

Loading