OCPBUGS-77773: fix backend server health check if DCM is enabled#747
OCPBUGS-77773: fix backend server health check if DCM is enabled#747jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9f0feb6 to
1a6e021
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughComputes and reuses a per-backend Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/router/template/configmanager/haproxy/manager.go`:
- Around line 513-526: The current branch that forces a reload when
len(newEndpoints) > len(oldEndpoints) incorrectly triggers on 0→1 recoveries;
change the condition so it only considers additions when there was at least one
previous endpoint (e.g., require len(oldEndpoints) > 0) before iterating servers
and applying the staticCount/isDynamicBackendServer check for backendName; this
limits the single-endpoint health-check reload logic to true single→multi
transitions rather than 0→1 recoveries.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6ba5be08-7415-484f-a7bf-d842c835c717
📒 Files selected for processing (2)
images/router/haproxy/conf/haproxy-config.templatepkg/router/template/configmanager/haproxy/manager.go
|
/assign @Thealisyed |
Thealisyed
left a comment
There was a problem hiding this comment.
Does the current integration test coverage exercise the 1 - 2 scale out path with DCM. Is it worth adding that test coverage anyway?
|
Hi @Thealisyed this is a good point. I still have openshift/origin#30741 pending due to some missing PRs merged, I'll add a specific test for this scenario as well. Just cc you from there. |
|
tested it with 4.22.0-0-2026-03-13-061337-test-ci-ln-2qjtwd2-latest, when the replicas was scale up from 1 to 2, check inter 5000ms was added to the server pod in the haproxy.config as expected, also it was added to the server-template _dynamic-pod |
|
/verified by @ShudiLi |
|
@ShudiLi: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Thealisyed
left a comment
There was a problem hiding this comment.
/lgtm
Thanks! Left the approval tag for Grant :)
| {{- with $size := $dynamicConfigManager.ServerTemplateSize $cfgIdx }} | ||
| dynamic-cookie-key {{ $cfg.RoutingKeyName }} | ||
| server-template {{ $name }}- 1-{{ $size }} 172.4.0.4:8765 check disabled | ||
| server-template {{ $name }}- 1-{{ $size }} 172.4.0.4:8765 check inter {{ $health_check_interval }} disabled |
There was a problem hiding this comment.
Did you consider adding unit test for this? I know unit testing for DCM is not comprehensive yet, but seems like it would fit in TestConfigTemplate in pkg/router/router_test.go:
"server-template with default health check interval": {
mustCreateWithConfig{
mustCreateEndpointSlices: []mustCreateEndpointSlice{
{
name: "servicest2",
serviceName: "servicest2",
addresses: []string{"1.1.1.1"},
},
},
mustCreateRoute: mustCreateRoute{
name: "st2",
host: "st2example.com",
targetServiceName: "servicest2",
weight: 1,
time: start,
},
mustMatchConfig: mustMatchConfig{
section: "backend",
sectionName: insecureBackendName(h.namespace, "st2"),
attribute: "server-template",
value: "inter 5000ms", // Default value
},
},
},
"server-template with custom health check interval": {
mustCreateWithConfig{
mustCreateEndpointSlices: []mustCreateEndpointSlice{
{
name: "servicest1",
serviceName: "servicest1",
addresses: []string{"1.1.1.1"}, // Single endpoint initially
},
},
mustCreateRoute: mustCreateRoute{
name: "st1",
host: "st1example.com",
targetServiceName: "servicest1",
weight: 1,
time: start,
annotations: map[string]string{
"router.openshift.io/haproxy.health.check.interval": "15s",
},
},
mustMatchConfig: mustMatchConfig{
section: "backend",
sectionName: insecureBackendName(h.namespace, "st1"),
attribute: "server-template",
value: "inter 15s",
},
},
},
Then support for ServerTemplate would need to be added to matchConfig:
case []haproxyconfparsertypes.ServerTemplate:
if m.fullMatch {
for _, a := range data {
params := ""
for _, p := range a.Params {
params += " " + p.String()
}
fullValue := a.Prefix + " " + a.NumOrRange + " " + a.Fqdn + params
if fullValue == m.value {
contains = true
break
}
}
} else {
for _, a := range data {
for _, b := range a.Params {
contains = contains || b.String() == m.value
}
}
}
If you have unit testing planned in future work - feel free to ignore.
There was a problem hiding this comment.
Your suggestion makes sense. Maybe things change in a way that the test is completely outdated, but the functionality is better covered until this happens.
Also, this was an opportunity to better understand how router tests the template, and this was a good experience, since I just used your first snippet, understood its intent, ran, it failed, I debugged, fixed it, and only after that I realized I just read half of your message. Btw I've just implemented the "not fullMatch" branch on my change. Learning every day 😅
|
Generally LGTM - I review the problem and the fix, and it seems like a simple fix until we get something more elegant in place. I'll provide approve when @jcmoraisjr gets a chance to respond to my comment about unit testing. |
Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one. This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled. The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496 That said there are two distinct changes happening in this PR: * Skipping dynamic change if scaling out and have just one replica * Adding `inter` keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas. https://issues.redhat.com/browse/OCPBUGS-77773
1a6e021 to
bb8f133
Compare
|
@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@jcmoraisjr: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Thanks for unit tests @jcmoraisjr! I think this a step forward in getting DCM stable/useable. @Thealisyed feel free to LGTM again, but it was just unit tests that were added. /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gcs278 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.
This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.
The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496
That said there are two distinct changes happening in this PR:
interkeyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.https://issues.redhat.com/browse/OCPBUGS-77773