Is there an existing issue for this problem?
Install method
Invoke's Launcher
Operating system
Windows
GPU vendor
Nvidia (CUDA)
GPU model
No response
GPU VRAM
No response
Version number
6.13 rc2
Browser
No response
System Information
No response
What happened
Slow generation on hardware without native bf16 acceleration
Issue
TorchDevice.choose_bfloat16_safe_dtype in invokeai/backend/util/devices.py decides whether to use bf16 by allocating a bf16 tensor on the target device and catching TypeError:
try:
torch.tensor([1.0], dtype=torch.bfloat16, device=device)
return torch.bfloat16
except TypeError:
return torch.float16 if device.type == "cuda" else torch.float32
This detects whether the device can allocate bf16, not whether it can accelerate bf16 ops. PyTorch allocates bf16 on a wide range of hardware that has no bf16 tensor-core path, so operations silently fall back to unaccelerated kernels.
Affected hardware:
Consequence
The helper is the dtype gate for Anima, Z-Image (Base/Turbo/Edit), Qwen Image, Flux2 Klein, and the Anima/Z-Image/Flux2 text encoders. On affected hardware these models load weights in bf16 and run every forward in unaccelerated bf16. Matmuls bypass tensor cores; typical slowdown is 2–4× on Turing and up to ~10× on MPS.
A naive global fallback to fp16 doesn't work because not every model supports fp16 inference. I verified this locally on an RTX 4070 by forcing choose_bfloat16_safe_dtype to return torch.float16 unconditionally and regenerating against bf16 references with identical seeds:
- Anima: clean output, visually consistent with bf16 across multiple test prompts.
- Z-Image Turbo: pure black output — the model overflows in fp16 (likely in the attention softmax) and produces NaN-driven garbage.
So fp16 is safe for some models and unsafe for others. The fix has to be per-model.
Fix
1. Replace the allocation probe with a real hardware capability check
if device.type == "cuda" and torch.cuda.is_available():
props = torch.cuda.get_device_properties(device)
if hasattr(props, "gcnArchName") and any(
a in props.gcnArchName for a in AMD_NO_BF16_ARCHS
):
has_hw_bf16 = False
else:
has_hw_bf16 = props.major >= 8 # Ampere+
elif device.type == "mps":
has_hw_bf16 = False # No shipped Apple Silicon has hardware bf16
else:
has_hw_bf16 = False
torch.cuda.is_bf16_supported() is unsuitable here because older PyTorch versions returned True on Turing (it accepted software emulation; including_emulation=False was added later). Compute capability is the actual hardware line.
2. Make the fallback per-model
Each model loader declares its acceptable inference dtypes; the helper returns the first one the hardware can accelerate.
# anima
SUPPORTED_DTYPES = [torch.bfloat16, torch.float16, torch.float32]
# flux2_klein
SUPPORTED_DTYPES = [torch.bfloat16, torch.float16, torch.float32]
# z_image
SUPPORTED_DTYPES = [torch.bfloat16, torch.float32] # fp16 unsupported
# qwen_image
SUPPORTED_DTYPES = [torch.bfloat16, torch.float32] # fp16 unsupported
Resulting behavior
On affected hardware (Turing, pre-RDNA3, Apple Silicon) the changes are:
| Model |
Before |
After |
| Anima |
bf16 unaccelerated |
fp16 accelerated |
| Flux2 Klein |
bf16 unaccelerated |
fp16 accelerated |
| Z-Image |
bf16 unaccelerated |
bf16 unchanged |
| Qwen Image |
bf16 unaccelerated |
bf16 unchanged |
Ampere+ behavior is unchanged for all models. See the note below on why Z-Image and Qwen Image stay on bf16 rather than falling through to fp32.
Note on Z-Image / Qwen Image: these models exclude fp16, so the natural next entry is fp32. But fp32 doubles the weight footprint, which on Turing-class cards (8–11 GB) and most Apple Silicon configurations forces CPU offloading and ends up slower than unaccelerated bf16 that fits in VRAM. The cleanest first iteration is to keep bf16 as the only entry for these models, accepting current behavior on affected hardware.
Scope
invokeai/backend/util/devices.py — replace allocation probe with capability check
- Affected model loaders and invocations (
anima.py, z_image.py, qwen_image.py, Anima/Z-Image/Flux2 text encoders) — declare SUPPORTED_DTYPES and call the updated helper
- No changes to model architectures, schedulers, or VAE paths
What you expected to happen
Invoke should automatically choose the fastest, best precision for a model for a user's given hardware
How to reproduce the problem
No response
Additional context
No response
Discord username
No response
Is there an existing issue for this problem?
Install method
Invoke's Launcher
Operating system
Windows
GPU vendor
Nvidia (CUDA)
GPU model
No response
GPU VRAM
No response
Version number
6.13 rc2
Browser
No response
System Information
No response
What happened
Slow generation on hardware without native bf16 acceleration
Issue
TorchDevice.choose_bfloat16_safe_dtypeininvokeai/backend/util/devices.pydecides whether to use bf16 by allocating a bf16 tensor on the target device and catchingTypeError:This detects whether the device can allocate bf16, not whether it can accelerate bf16 ops. PyTorch allocates bf16 on a wide range of hardware that has no bf16 tensor-core path, so operations silently fall back to unaccelerated kernels.
Affected hardware:
Consequence
The helper is the dtype gate for Anima, Z-Image (Base/Turbo/Edit), Qwen Image, Flux2 Klein, and the Anima/Z-Image/Flux2 text encoders. On affected hardware these models load weights in bf16 and run every forward in unaccelerated bf16. Matmuls bypass tensor cores; typical slowdown is 2–4× on Turing and up to ~10× on MPS.
A naive global fallback to fp16 doesn't work because not every model supports fp16 inference. I verified this locally on an RTX 4070 by forcing
choose_bfloat16_safe_dtypeto returntorch.float16unconditionally and regenerating against bf16 references with identical seeds:So fp16 is safe for some models and unsafe for others. The fix has to be per-model.
Fix
1. Replace the allocation probe with a real hardware capability check
torch.cuda.is_bf16_supported()is unsuitable here because older PyTorch versions returnedTrueon Turing (it accepted software emulation;including_emulation=Falsewas added later). Compute capability is the actual hardware line.2. Make the fallback per-model
Each model loader declares its acceptable inference dtypes; the helper returns the first one the hardware can accelerate.
Resulting behavior
On affected hardware (Turing, pre-RDNA3, Apple Silicon) the changes are:
Ampere+ behavior is unchanged for all models. See the note below on why Z-Image and Qwen Image stay on bf16 rather than falling through to fp32.
Note on Z-Image / Qwen Image: these models exclude fp16, so the natural next entry is fp32. But fp32 doubles the weight footprint, which on Turing-class cards (8–11 GB) and most Apple Silicon configurations forces CPU offloading and ends up slower than unaccelerated bf16 that fits in VRAM. The cleanest first iteration is to keep bf16 as the only entry for these models, accepting current behavior on affected hardware.
Scope
invokeai/backend/util/devices.py— replace allocation probe with capability checkanima.py,z_image.py,qwen_image.py, Anima/Z-Image/Flux2 text encoders) — declareSUPPORTED_DTYPESand call the updated helperWhat you expected to happen
Invoke should automatically choose the fastest, best precision for a model for a user's given hardware
How to reproduce the problem
No response
Additional context
No response
Discord username
No response