When should you stop adding retries and start redesigning a failing system? #993
-
|
In distributed systems, retries are often the first solution for transient failures. At what point do retries stop being a reasonable mitigation and become a sign that the system needs a deeper redesign? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Retries stop being helpful when they hide the root cause instead of buying time to fix it. A few practical indicators: Retries significantly increase tail latency or amplify load during partial outages. Failures become correlated rather than isolated, causing retry storms. Success depends on retry count rather than system health. When retries shift from handling rare transient issues to being required for normal operation, that’s usually a signal that the system’s failure modes aren’t well understood. At that point, redesigning for better isolation, backpressure, or clearer failure boundaries tends to yield more long-term value than further tuning retry parameters. |
Beta Was this translation helpful? Give feedback.
Retries stop being helpful when they hide the root cause instead of buying time to fix it.
A few practical indicators:
Retries significantly increase tail latency or amplify load during partial outages.
Failures become correlated rather than isolated, causing retry storms.
Success depends on retry count rather than system health.
When retries shift from handling rare transient issues to being required for normal operation, that’s usually a signal that the system’s failure modes aren’t well understood.
At that point, redesigning for better isolation, backpressure, or clearer failure boundaries tends to yield more long-term value than further tuning retry parameters.