fix: return node evaluation errors to trigger reconcile retries#229
fix: return node evaluation errors to trigger reconcile retries#229Nesar976 wants to merge 3 commits intokubernetes-sigs:mainfrom
Conversation
Signed-off-by: Nesar976 <kavrinesar@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Nesar976 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for node-readiness-controller canceled.
|
|
Welcome @Nesar976! |
|
Hi @Nesar976. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
sujalshah-bit
left a comment
There was a problem hiding this comment.
Thanks for the PR @Nesar976 .
I think returning aggregated errors here improves convergence semantics because previously transient failures (patch conflicts, API timeouts, status update failures, etc.) could get logged and then effectively lost without another reconcile trigger.
One concern though is that we're currently aggregating all errors:
errs = append(errs, err)
...
return errors.Join(errs...)Since Reconcile() returns that error directly, controller-runtime will requeue the node reconcile for any failure, including potentially non-retryable/business-logic errors.
For example:
- transient API/conflict errors → retry makes sense
- invalid rule configuration / unsupported evaluation state → retry likely won't help
With the current approach, a permanently failing rule could keep causing unnecessary reconcile retries for the node.
Would it make sense to distinguish retryable/system errors from expected rule evaluation failures and only return the retryable ones while still recording all failures in status/events/metrics?
Also, sign CLA :)
Co-authored-by: Sujal Shah <73663475+sujalshah-bit@users.noreply.github.com>
Co-authored-by: Sujal Shah <73663475+sujalshah-bit@users.noreply.github.com>
|
/ok-to-test |
|
looks like this PR is duplicate to #222 |
This change ensures that node evaluation failures are propagated back to the reconcile loop instead of being silently ignored.
Previously, transient failures during rule evaluation or taint patching were only logged internally, preventing controller-runtime from requeueing the reconcile request.
This PR:
Related Issue
Fixes #220
Type of Change
/kind bug