[bug]: Slow generation on hardware without native bf16 support for some models

### Is there an existing issue for this problem?

- [x] I have searched the existing issues

### Install method

Invoke's Launcher

### Operating system

Windows

### GPU vendor

Nvidia (CUDA)

### GPU model

_No response_

### GPU VRAM

_No response_

### Version number

6.13 rc2

### Browser

_No response_

### System Information

_No response_

### What happened

# Slow generation on hardware without native bf16 acceleration

## Issue

`TorchDevice.choose_bfloat16_safe_dtype` in `invokeai/backend/util/devices.py` decides whether to use bf16 by allocating a bf16 tensor on the target device and catching `TypeError`:

```python
try:
    torch.tensor([1.0], dtype=torch.bfloat16, device=device)
    return torch.bfloat16
except TypeError:
    return torch.float16 if device.type == "cuda" else torch.float32
```

This detects whether the device can **allocate** bf16, not whether it can **accelerate** bf16 ops. PyTorch allocates bf16 on a wide range of hardware that has no bf16 tensor-core path, so operations silently fall back to unaccelerated kernels.

Affected hardware:

- **NVIDIA Turing** (RTX 20-series, T4, Quadro RTX, Titan RTX). bf16 tensor cores arrived with Ampere (compute capability 8.0); Turing is 7.5.
- **AMD pre-RDNA3** (gfx1030 / RDNA2 and older).
- **All Apple Silicon (M1, M2, M3, M4).** No shipped Apple chip has hardware bf16. MPS allocates bf16 tensors fine but routes operations through unoptimized kernels — documented slowdowns of up to ~10× vs fp16 (e.g. [pytorch/pytorch#141864](https://github.com/pytorch/pytorch/issues/141864), [Qwen Image Edit went 80s→10min on M1 Max after a default-dtype change to bf16](https://lilting.ch/en/articles/comfyui-qwen-mps-bf16-slowdown)).

## Consequence

The helper is the dtype gate for Anima, Z-Image (Base/Turbo/Edit), Qwen Image, Flux2 Klein, and the Anima/Z-Image/Flux2 text encoders. On affected hardware these models load weights in bf16 and run every forward in unaccelerated bf16. Matmuls bypass tensor cores; typical slowdown is 2–4× on Turing and up to ~10× on MPS.

A naive global fallback to fp16 doesn't work because not every model supports fp16 inference. I verified this locally on an RTX 4070 by forcing `choose_bfloat16_safe_dtype` to return `torch.float16` unconditionally and regenerating against bf16 references with identical seeds:

- **Anima:** clean output, visually consistent with bf16 across multiple test prompts.
- **Z-Image Turbo:** pure black output — the model overflows in fp16 (likely in the attention softmax) and produces NaN-driven garbage.

So fp16 is safe for some models and unsafe for others. The fix has to be per-model.

## Fix

### 1. Replace the allocation probe with a real hardware capability check

```python
if device.type == "cuda" and torch.cuda.is_available():
    props = torch.cuda.get_device_properties(device)
    if hasattr(props, "gcnArchName") and any(
        a in props.gcnArchName for a in AMD_NO_BF16_ARCHS
    ):
        has_hw_bf16 = False
    else:
        has_hw_bf16 = props.major >= 8  # Ampere+
elif device.type == "mps":
    has_hw_bf16 = False  # No shipped Apple Silicon has hardware bf16
else:
    has_hw_bf16 = False
```

`torch.cuda.is_bf16_supported()` is unsuitable here because older PyTorch versions returned `True` on Turing (it accepted software emulation; `including_emulation=False` was added later). Compute capability is the actual hardware line.

### 2. Make the fallback per-model

Each model loader declares its acceptable inference dtypes; the helper returns the first one the hardware can accelerate.

```python
# anima
SUPPORTED_DTYPES = [torch.bfloat16, torch.float16, torch.float32]

# flux2_klein
SUPPORTED_DTYPES = [torch.bfloat16, torch.float16, torch.float32]

# z_image
SUPPORTED_DTYPES = [torch.bfloat16, torch.float32]  # fp16 unsupported

# qwen_image
SUPPORTED_DTYPES = [torch.bfloat16, torch.float32]  # fp16 unsupported
```

### Resulting behavior

On affected hardware (Turing, pre-RDNA3, Apple Silicon) the changes are:

| Model | Before | After |
|---|---|---|
| Anima | bf16 unaccelerated | fp16 accelerated |
| Flux2 Klein | bf16 unaccelerated | fp16 accelerated |
| Z-Image | bf16 unaccelerated | bf16 unchanged |
| Qwen Image | bf16 unaccelerated | bf16 unchanged |

Ampere+ behavior is unchanged for all models. See the note below on why Z-Image and Qwen Image stay on bf16 rather than falling through to fp32.

**Note on Z-Image / Qwen Image:** these models exclude fp16, so the natural next entry is fp32. But fp32 doubles the weight footprint, which on Turing-class cards (8–11 GB) and most Apple Silicon configurations forces CPU offloading and ends up slower than unaccelerated bf16 that fits in VRAM. The cleanest first iteration is to keep bf16 as the only entry for these models, accepting current behavior on affected hardware. 

### Scope

- `invokeai/backend/util/devices.py` — replace allocation probe with capability check
- Affected model loaders and invocations (`anima.py`, `z_image.py`, `qwen_image.py`, Anima/Z-Image/Flux2 text encoders) — declare `SUPPORTED_DTYPES` and call the updated helper
- No changes to model architectures, schedulers, or VAE paths


### What you expected to happen

Invoke should automatically choose the fastest, best precision for a model for a user's given hardware 

### How to reproduce the problem

_No response_

### Additional context

_No response_

### Discord username

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: Slow generation on hardware without native bf16 support for some models #9153

Is there an existing issue for this problem?

Install method

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

System Information

What happened

Slow generation on hardware without native bf16 acceleration

Issue

Consequence

Fix

1. Replace the allocation probe with a real hardware capability check

2. Make the fallback per-model

Resulting behavior

Scope

What you expected to happen

How to reproduce the problem

Additional context

Discord username

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Before	After
Anima	bf16 unaccelerated	fp16 accelerated
Flux2 Klein	bf16 unaccelerated	fp16 accelerated
Z-Image	bf16 unaccelerated	bf16 unchanged
Qwen Image	bf16 unaccelerated	bf16 unchanged

[bug]: Slow generation on hardware without native bf16 support for some models #9153

Description

Is there an existing issue for this problem?

Install method

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

System Information

What happened

Slow generation on hardware without native bf16 acceleration

Issue

Consequence

Fix

1. Replace the allocation probe with a real hardware capability check

2. Make the fallback per-model

Resulting behavior

Scope

What you expected to happen

How to reproduce the problem

Additional context

Discord username

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions