OCPBUGS-77773: fix backend server health check for DCM enabled#746
OCPBUGS-77773: fix backend server health check for DCM enabled#746jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
Conversation
Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica failing health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one. This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled. Options to change this behavior includes adding api calls to enable the first replica, or instead always enable it by default if DCM is enabled. The difficulty with the former is the amount of changes needed to make in the codebase: identify first replica without health check, del+add api calls we don't use (yet!!). The later seems a natural path since it is only for DCM enabled, and changes behavior only on single replica deployments, which should be the exception. Moreover, the benefit of having health check enabled in a simple way far outweight the small increase on cpu and network. That said there are two distinct changes happening in this PR: * Removing the criteria for active endpoint, now health check is always enabled if DCM is enabled; * Adding `inter` keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas. https://issues.redhat.com/browse/OCPBUGS-77773
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test e2e-aws-serial-2of2 |
|
@jcmoraisjr: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/close Closing on behalf of #747 |
|
@jcmoraisjr: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica failing health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.
This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.
Options to change this behavior includes adding api calls to enable the first replica, or instead always enable it by default if DCM is enabled. The difficulty with the former is the amount of changes needed to make in the codebase: identify first replica without health check, del+add api calls we don't use (yet!!). The later seems a natural path since it is only for DCM enabled, and changes behavior only on single replica deployments, which should be the exception. Moreover, the benefit of having health check enabled in a simple way far outweight the small increase on cpu and network.
That said there are two distinct changes happening in this PR:
interkeyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.https://issues.redhat.com/browse/OCPBUGS-77773