When should you stop adding retries and start redesigning a failing system? #993

iaversao7-sketch · 2026-02-08T12:31:57Z

iaversao7-sketch
Feb 8, 2026

In distributed systems, retries are often the first solution for transient failures.

At what point do retries stop being a reasonable mitigation and become a sign that the system needs a deeper redesign?
Are there practical indicators you look for before deciding to revisit the architecture instead of tuning retry logic?

Answered by DaviBonetto

Feb 8, 2026

Retries stop being helpful when they hide the root cause instead of buying time to fix it.

A few practical indicators:

Retries significantly increase tail latency or amplify load during partial outages.

Failures become correlated rather than isolated, causing retry storms.

Success depends on retry count rather than system health.

When retries shift from handling rare transient issues to being required for normal operation, that’s usually a signal that the system’s failure modes aren’t well understood.

At that point, redesigning for better isolation, backpressure, or clearer failure boundaries tends to yield more long-term value than further tuning retry parameters.

View full answer

DaviBonetto · 2026-02-08T12:32:27Z

DaviBonetto
Feb 8, 2026

Retries stop being helpful when they hide the root cause instead of buying time to fix it.

A few practical indicators:

Retries significantly increase tail latency or amplify load during partial outages.

Failures become correlated rather than isolated, causing retry storms.

Success depends on retry count rather than system health.

When retries shift from handling rare transient issues to being required for normal operation, that’s usually a signal that the system’s failure modes aren’t well understood.

At that point, redesigning for better isolation, backpressure, or clearer failure boundaries tends to yield more long-term value than further tuning retry parameters.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recode Hive

When should you stop adding retries and start redesigning a failing system? #993

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Recode Hive

When should you stop adding retries and start redesigning a failing system? #993

Uh oh!

iaversao7-sketch Feb 8, 2026

Replies: 1 comment

Uh oh!

DaviBonetto Feb 8, 2026

iaversao7-sketch
Feb 8, 2026

DaviBonetto
Feb 8, 2026