Skip to content

[bug]: Slow generation on hardware without native bf16 support for some models #9153

@kappacommit

Description

@kappacommit

Is there an existing issue for this problem?

  • I have searched the existing issues

Install method

Invoke's Launcher

Operating system

Windows

GPU vendor

Nvidia (CUDA)

GPU model

No response

GPU VRAM

No response

Version number

6.13 rc2

Browser

No response

System Information

No response

What happened

Slow generation on hardware without native bf16 acceleration

Issue

TorchDevice.choose_bfloat16_safe_dtype in invokeai/backend/util/devices.py decides whether to use bf16 by allocating a bf16 tensor on the target device and catching TypeError:

try:
    torch.tensor([1.0], dtype=torch.bfloat16, device=device)
    return torch.bfloat16
except TypeError:
    return torch.float16 if device.type == "cuda" else torch.float32

This detects whether the device can allocate bf16, not whether it can accelerate bf16 ops. PyTorch allocates bf16 on a wide range of hardware that has no bf16 tensor-core path, so operations silently fall back to unaccelerated kernels.

Affected hardware:

Consequence

The helper is the dtype gate for Anima, Z-Image (Base/Turbo/Edit), Qwen Image, Flux2 Klein, and the Anima/Z-Image/Flux2 text encoders. On affected hardware these models load weights in bf16 and run every forward in unaccelerated bf16. Matmuls bypass tensor cores; typical slowdown is 2–4× on Turing and up to ~10× on MPS.

A naive global fallback to fp16 doesn't work because not every model supports fp16 inference. I verified this locally on an RTX 4070 by forcing choose_bfloat16_safe_dtype to return torch.float16 unconditionally and regenerating against bf16 references with identical seeds:

  • Anima: clean output, visually consistent with bf16 across multiple test prompts.
  • Z-Image Turbo: pure black output — the model overflows in fp16 (likely in the attention softmax) and produces NaN-driven garbage.

So fp16 is safe for some models and unsafe for others. The fix has to be per-model.

Fix

1. Replace the allocation probe with a real hardware capability check

if device.type == "cuda" and torch.cuda.is_available():
    props = torch.cuda.get_device_properties(device)
    if hasattr(props, "gcnArchName") and any(
        a in props.gcnArchName for a in AMD_NO_BF16_ARCHS
    ):
        has_hw_bf16 = False
    else:
        has_hw_bf16 = props.major >= 8  # Ampere+
elif device.type == "mps":
    has_hw_bf16 = False  # No shipped Apple Silicon has hardware bf16
else:
    has_hw_bf16 = False

torch.cuda.is_bf16_supported() is unsuitable here because older PyTorch versions returned True on Turing (it accepted software emulation; including_emulation=False was added later). Compute capability is the actual hardware line.

2. Make the fallback per-model

Each model loader declares its acceptable inference dtypes; the helper returns the first one the hardware can accelerate.

# anima
SUPPORTED_DTYPES = [torch.bfloat16, torch.float16, torch.float32]

# flux2_klein
SUPPORTED_DTYPES = [torch.bfloat16, torch.float16, torch.float32]

# z_image
SUPPORTED_DTYPES = [torch.bfloat16, torch.float32]  # fp16 unsupported

# qwen_image
SUPPORTED_DTYPES = [torch.bfloat16, torch.float32]  # fp16 unsupported

Resulting behavior

On affected hardware (Turing, pre-RDNA3, Apple Silicon) the changes are:

Model Before After
Anima bf16 unaccelerated fp16 accelerated
Flux2 Klein bf16 unaccelerated fp16 accelerated
Z-Image bf16 unaccelerated bf16 unchanged
Qwen Image bf16 unaccelerated bf16 unchanged

Ampere+ behavior is unchanged for all models. See the note below on why Z-Image and Qwen Image stay on bf16 rather than falling through to fp32.

Note on Z-Image / Qwen Image: these models exclude fp16, so the natural next entry is fp32. But fp32 doubles the weight footprint, which on Turing-class cards (8–11 GB) and most Apple Silicon configurations forces CPU offloading and ends up slower than unaccelerated bf16 that fits in VRAM. The cleanest first iteration is to keep bf16 as the only entry for these models, accepting current behavior on affected hardware.

Scope

  • invokeai/backend/util/devices.py — replace allocation probe with capability check
  • Affected model loaders and invocations (anima.py, z_image.py, qwen_image.py, Anima/Z-Image/Flux2 text encoders) — declare SUPPORTED_DTYPES and call the updated helper
  • No changes to model architectures, schedulers, or VAE paths

What you expected to happen

Invoke should automatically choose the fastest, best precision for a model for a user's given hardware

How to reproduce the problem

No response

Additional context

No response

Discord username

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions