Skip to content

errors: define a coherent strategy for retryable 409 Conflict errors #779

@eshulman2

Description

@eshulman2

Background

The discussion in #667, #771, and #772 has established partial handling for 409 Conflict errors. The current state (as of the merge of #772) is:

  • 409 Conflict is generally non-retryable (terminal)
  • Exception: Neutron quota-exceeded 409s are retryable (isNeutronQuotaError), because quota can free up without spec changes and Neutron is the only OpenStack service that uses 409 for quota (others use 413 or 403)

Open question 1: transitional-state 409s

During the review of #771, a closely related scenario was raised but explicitly deferred to a follow-up: OpenStack services return 409 when an operation is attempted on a resource in a transitional state (e.g. deleting a LoadBalancer that is still in PROVISIONING status). These errors are retryable simply by waiting — the conflict resolves itself without any spec change. Unlike quota errors, transitional-state 409s are not specific to one service (Octavia, Neutron, Nova, etc. all have transitional states).

Open question 2: whitelist vs blacklist for retryable 409s

The Neutron quota carve-out uses a whitelist approach: 409 is terminal by default, with specific known-retryable cases carved out. But the same logic used to justify retrying quota errors applies much more broadly:

  • Quota exceeded → resolved by freeing quota externally (no spec change)
  • Duplicate name → resolved by deleting the conflicting resource externally (no spec change)
  • Resource in transitional state → resolved by waiting (no spec change)

If "can be resolved without a spec change" is the criterion for retryability, then most 409s arguably qualify. This raises the question of whether a blacklist approach is more appropriate: treat 409 as retryable by default, and mark only specific known-terminal cases as non-retryable.

It is worth asking: are there 409 scenarios that are only solvable by a spec change? If not, the whitelist approach may be the wrong default.

Options on the table

  1. Whitelist (current direction): 409 is terminal by default; carve out specific retryable patterns by inspecting the error body. Con: ongoing maintenance burden; easy to miss cases; risks being inconsistent across services.

  2. Blacklist: 409 is retryable by default; mark only specific known-terminal patterns as non-retryable. Pro: consistent with how most 409s behave in practice. Con: requires identifying which 409s truly are only fixable via spec change.

  3. All 409s retryable: Treat every Conflict as transient with exponential backoff. Pro: simple and handles all cases. Con: if any 409 can only be fixed by a spec change, it would spin indefinitely.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions