errors: define a coherent strategy for retryable 409 Conflict errors

## Background

The discussion in #667, #771, and #772 has established partial handling for 409 Conflict errors. The current state (as of the merge of #772) is:
- 409 Conflict is generally **non-retryable** (terminal)
- **Exception**: Neutron quota-exceeded 409s are retryable (`isNeutronQuotaError`), because quota can free up without spec changes and Neutron is the only OpenStack service that uses 409 for quota (others use 413 or 403)

## Open question 1: transitional-state 409s

During the review of #771, a closely related scenario was raised but explicitly deferred to a follow-up: OpenStack services return 409 when an operation is attempted on a resource in a transitional state (e.g. deleting a LoadBalancer that is still in `PROVISIONING` status). These errors are retryable simply by waiting — the conflict resolves itself without any spec change. Unlike quota errors, transitional-state 409s are not specific to one service (Octavia, Neutron, Nova, etc. all have transitional states).

## Open question 2: whitelist vs blacklist for retryable 409s

The Neutron quota carve-out uses a **whitelist** approach: 409 is terminal by default, with specific known-retryable cases carved out. But the same logic used to justify retrying quota errors applies much more broadly:

- Quota exceeded → resolved by freeing quota externally (no spec change)
- Duplicate name → resolved by deleting the conflicting resource externally (no spec change)
- Resource in transitional state → resolved by waiting (no spec change)

If "can be resolved without a spec change" is the criterion for retryability, then most 409s arguably qualify. This raises the question of whether a **blacklist** approach is more appropriate: treat 409 as retryable by default, and mark only specific known-terminal cases as non-retryable.

It is worth asking: are there 409 scenarios that are *only* solvable by a spec change? If not, the whitelist approach may be the wrong default.

## Options on the table

1. **Whitelist (current direction)**: 409 is terminal by default; carve out specific retryable patterns by inspecting the error body. Con: ongoing maintenance burden; easy to miss cases; risks being inconsistent across services.

2. **Blacklist**: 409 is retryable by default; mark only specific known-terminal patterns as non-retryable. Pro: consistent with how most 409s behave in practice. Con: requires identifying which 409s truly are only fixable via spec change.

3. **All 409s retryable**: Treat every Conflict as transient with exponential backoff. Pro: simple and handles all cases. Con: if any 409 can only be fixed by a spec change, it would spin indefinitely.

## References
- #667 — original bug report; broader discussion of retryability criterion and exponential backoff for all 409s
- #771 — rewrote `IsRetryable`; deferred transitional-state 409s
- #772 — added Neutron OverQuota carve-out; review thread on precedent risk of case-by-case whitelisting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors: define a coherent strategy for retryable 409 Conflict errors #779

Background

Open question 1: transitional-state 409s

Open question 2: whitelist vs blacklist for retryable 409s

Options on the table

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

errors: define a coherent strategy for retryable 409 Conflict errors #779

Description

Background

Open question 1: transitional-state 409s

Open question 2: whitelist vs blacklist for retryable 409s

Options on the table

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions