Skip to content

OCPBUGS-77773: fix backend server health check if DCM is enabled#747

Open
jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
jcmoraisjr:OCPBUGS-77773-single-replica-reload
Open

OCPBUGS-77773: fix backend server health check if DCM is enabled#747
jcmoraisjr wants to merge 1 commit intoopenshift:masterfrom
jcmoraisjr:OCPBUGS-77773-single-replica-reload

Conversation

@jcmoraisjr
Copy link
Member

@jcmoraisjr jcmoraisjr commented Mar 4, 2026

Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.

This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.

The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496

That said there are two distinct changes happening in this PR:

  • Skipping dynamic change if scaling out and have just one replica
  • Adding inter keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.

https://issues.redhat.com/browse/OCPBUGS-77773

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 4, 2026
@openshift-ci-robot
Copy link
Contributor

@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.

This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.

The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496

That said there are two distinct changes happening in this PR:

  • Skipping dynamic change if scaling out and have just one replica
  • Adding inter keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.

https://issues.redhat.com/browse/OCPBUGS-77773

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 4, 2026
@jcmoraisjr jcmoraisjr force-pushed the OCPBUGS-77773-single-replica-reload branch from 9f0feb6 to 1a6e021 Compare March 9, 2026 20:09
@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7f1a9a7c-2967-4f4c-838e-ebdf958d3cc0

📥 Commits

Reviewing files that changed from the base of the PR and between 1a6e021 and bb8f133.

📒 Files selected for processing (3)
  • images/router/haproxy/conf/haproxy-config.template
  • pkg/router/router_test.go
  • pkg/router/template/configmanager/haproxy/manager.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • images/router/haproxy/conf/haproxy-config.template

Walkthrough

Computes and reuses a per-backend health_check_interval in the HAProxy template; adds a guard in the endpoint manager preventing dynamic updates when adding endpoints to a backend with exactly one static server; extends tests to validate server-template health-check inter output and matching logic.

Changes

Cohort / File(s) Summary
HAProxy Configuration Template
images/router/haproxy/conf/haproxy-config.template
Introduces a single per-backend health_check_interval variable (computed from annotation/env with clipHAProxyTimeoutValue) and replaces repeated inline expressions with the variable for all check inter directives (regular backends, passthrough, and server-template outputs).
Endpoint Manager Validation
pkg/router/template/configmanager/haproxy/manager.go
Adds an early guard in ReplaceRouteEndpoints: when adding endpoints (newEndpoints longer than oldEndpoints) and the backend currently has exactly one non-dynamic server, the function returns an error to require a reload rather than applying dynamic updates.
Tests / Matching Logic
pkg/router/router_test.go
Adds test cases asserting default and annotated server-template health-check inter values and extends matchConfig to compare parsed []haproxyconfparsertypes.ServerTemplate entries (full-string match and parameter matching).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly references the specific issue (OCPBUGS-77773) and accurately describes the main purpose: fixing backend server health check behavior when DCM is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed The pull request does not contain any Ginkgo test files or modifications to existing tests; changes are limited to configuration template and manager logic files with no test code.
Test Structure And Quality ✅ Passed This pull request does not contain any Ginkgo test code changes. Modifications are limited to a HAProxy configuration template and a Go manager source file, neither of which are test files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/router/template/configmanager/haproxy/manager.go`:
- Around line 513-526: The current branch that forces a reload when
len(newEndpoints) > len(oldEndpoints) incorrectly triggers on 0→1 recoveries;
change the condition so it only considers additions when there was at least one
previous endpoint (e.g., require len(oldEndpoints) > 0) before iterating servers
and applying the staticCount/isDynamicBackendServer check for backendName; this
limits the single-endpoint health-check reload logic to true single→multi
transitions rather than 0→1 recoveries.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ba5be08-7415-484f-a7bf-d842c835c717

📥 Commits

Reviewing files that changed from the base of the PR and between b3414b2 and 1a6e021.

📒 Files selected for processing (2)
  • images/router/haproxy/conf/haproxy-config.template
  • pkg/router/template/configmanager/haproxy/manager.go

@alebedev87
Copy link
Contributor

/assign @Thealisyed
/assign @gcs278

Copy link

@Thealisyed Thealisyed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the current integration test coverage exercise the 1 - 2 scale out path with DCM. Is it worth adding that test coverage anyway?

@jcmoraisjr
Copy link
Member Author

Hi @Thealisyed this is a good point. I still have openshift/origin#30741 pending due to some missing PRs merged, I'll add a specific test for this scenario as well. Just cc you from there.

@ShudiLi
Copy link
Member

ShudiLi commented Mar 13, 2026

tested it with 4.22.0-0-2026-03-13-061337-test-ci-ln-2qjtwd2-latest, when the replicas was scale up from 1 to 2, check inter 5000ms was added to the server pod in the haproxy.config as expected, also it was added to the server-template _dynamic-pod

1.  when replicas was 1
 cookie b94bb237dc742029fe83e6d395082b86 insert indirect nocache httponly dynamic
  server pod:appach-server-59c457d9d4-sf4zq:unsec-apach:unsec-apach:10.129.2.18:8080 10.129.2.18:8080 cookie fcd6b1516c6d1198f367e4a9286c9c51 weight 1
  dynamic-cookie-key b94bb237dc742029fe83e6d395082b86
  server-template _dynamic-pod- 1-1 172.4.0.4:8765 check inter 5000ms disabled

2. when replicas was 2
  cookie b94bb237dc742029fe83e6d395082b86 insert indirect nocache httponly dynamic
  server pod:appach-server-59c457d9d4-p84sz:unsec-apach:unsec-apach:10.128.2.10:8080 10.128.2.10:8080 cookie 64fb594ed721f91956579b9b337ae61f weight 1 check inter 5000ms
  server pod:appach-server-59c457d9d4-sf4zq:unsec-apach:unsec-apach:10.129.2.18:8080 10.129.2.18:8080 cookie fcd6b1516c6d1198f367e4a9286c9c51 weight 1 check inter 5000ms
  dynamic-cookie-key b94bb237dc742029fe83e6d395082b86
  server-template _dynamic-pod- 1-1 172.4.0.4:8765 check inter 5000ms disabled

@ShudiLi
Copy link
Member

ShudiLi commented Mar 13, 2026

/verified by @ShudiLi

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 13, 2026
@openshift-ci-robot
Copy link
Contributor

@ShudiLi: This PR has been marked as verified by @ShudiLi.

Details

In response to this:

/verified by @ShudiLi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@Thealisyed Thealisyed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
Thanks! Left the approval tag for Grant :)

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2026
{{- with $size := $dynamicConfigManager.ServerTemplateSize $cfgIdx }}
dynamic-cookie-key {{ $cfg.RoutingKeyName }}
server-template {{ $name }}- 1-{{ $size }} 172.4.0.4:8765 check disabled
server-template {{ $name }}- 1-{{ $size }} 172.4.0.4:8765 check inter {{ $health_check_interval }} disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider adding unit test for this? I know unit testing for DCM is not comprehensive yet, but seems like it would fit in TestConfigTemplate in pkg/router/router_test.go:

		"server-template with default health check interval": {
			mustCreateWithConfig{
				mustCreateEndpointSlices: []mustCreateEndpointSlice{
					{
						name:        "servicest2",
						serviceName: "servicest2",
						addresses:   []string{"1.1.1.1"},
					},
				},
				mustCreateRoute: mustCreateRoute{
					name:              "st2",
					host:              "st2example.com",
					targetServiceName: "servicest2",
					weight:            1,
					time:              start,
				},
				mustMatchConfig: mustMatchConfig{
					section:     "backend",
					sectionName: insecureBackendName(h.namespace, "st2"),
					attribute:   "server-template",
					value:       "inter 5000ms", // Default value
				},
			},
		},
    "server-template with custom health check interval": {                                                                                                                                      
        mustCreateWithConfig{                                                                                                                                                                 
                mustCreateEndpointSlices: []mustCreateEndpointSlice{                                                                                                                          
                        {                                                                                                                                                                     
                                name:        "servicest1",                                                                                                                                    
                                serviceName: "servicest1",                                                                                                                                    
                                addresses:   []string{"1.1.1.1"},  // Single endpoint initially                                                                                               
                        },                                                                                                                                                                    
                },                                                                                                                                                                            
                mustCreateRoute: mustCreateRoute{                                                                                                                                             
                        name:              "st1",                                                                                                                                             
                        host:              "st1example.com",                                                                                                                                  
                        targetServiceName: "servicest1",                                                                                                                                      
                        weight:            1,                                                                                                                                                 
                        time:              start,                                                                                                                                             
                        annotations: map[string]string{                                                                                                                                       
                                "router.openshift.io/haproxy.health.check.interval": "15s",                                                                                                   
                        },                                                                                                                                                                    
                },                                                                                                                                                                            
                mustMatchConfig: mustMatchConfig{                                                                                                                                             
                        section:     "backend",                                                                                                                                               
                        sectionName: insecureBackendName(h.namespace, "st1"),                                                                                                                 
                        attribute:   "server-template",                                                                                                                                       
                        value:       "inter 15s",                                                                                                                                             
                },                                                                                                                                                                            
        },                                                                                                                                                                                    
  },

Then support for ServerTemplate would need to be added to matchConfig:

	case []haproxyconfparsertypes.ServerTemplate:
		if m.fullMatch {
			for _, a := range data {
				params := ""
				for _, p := range a.Params {
					params += " " + p.String()
				}
				fullValue := a.Prefix + " " + a.NumOrRange + " " + a.Fqdn + params
				if fullValue == m.value {
					contains = true
					break
				}
			}
		} else {
			for _, a := range data {
				for _, b := range a.Params {
					contains = contains || b.String() == m.value
				}
			}
		}

If you have unit testing planned in future work - feel free to ignore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suggestion makes sense. Maybe things change in a way that the test is completely outdated, but the functionality is better covered until this happens.

Also, this was an opportunity to better understand how router tests the template, and this was a good experience, since I just used your first snippet, understood its intent, ran, it failed, I debugged, fixed it, and only after that I realized I just read half of your message. Btw I've just implemented the "not fullMatch" branch on my change. Learning every day 😅

@gcs278
Copy link
Contributor

gcs278 commented Mar 24, 2026

Generally LGTM - I review the problem and the fix, and it seems like a simple fix until we get something more elegant in place.

I'll provide approve when @jcmoraisjr gets a chance to respond to my comment about unit testing.

Without DCM, router configure single replica backends without health
check. This saves a bit of cpu and network io, since a single failing
replica without health check will continue to disrupt the service. This
works without DCM because whenever a scaling out happens, haproxy is
reloaded with new configurations enabling health check on all the
replicas, including the first one.

This is not working with DCM because the code simply add the new replica
on an empty slot, ignoring the status of the other ones. In the end the
new replica has health check enabled and the first one continues with
health check disabled.

The approach used here is to skip DCM when detecting this scenario. This
has the advantage of not changing any current behavior, otoh a DCM
scenario is being disabled for now. This is going to be revisited after
4.22 via https://issues.redhat.com/browse/NE-2496

That said there are two distinct changes happening in this PR:

* Skipping dynamic change if scaling out and have just one replica
* Adding `inter` keyword in the server-template, so the dynamically
  added server will use the same health check interval of the other
  replicas.

https://issues.redhat.com/browse/OCPBUGS-77773
@jcmoraisjr jcmoraisjr force-pushed the OCPBUGS-77773-single-replica-reload branch from 1a6e021 to bb8f133 Compare March 24, 2026 17:26
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Mar 24, 2026
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 24, 2026
@openshift-ci-robot
Copy link
Contributor

@jcmoraisjr: This pull request references Jira Issue OCPBUGS-77773, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

Details

In response to this:

Without DCM, router configure single replica backends without health check. This saves a bit of cpu and network io, since a single failing replica without health check will continue to disrupt the service. This works without DCM because whenever a scaling out happens, haproxy is reloaded with new configurations enabling health check on all the replicas, including the first one.

This is not working with DCM because the code simply add the new replica on an empty slot, ignoring the status of the other ones. In the end the new replica has health check enabled and the first one continues with health check disabled.

The approach used here is to skip DCM when detecting this scenario. This has the advantage of not changing any current behavior, otoh a DCM scenario is being disabled for now. This is going to be revisited after 4.22 via https://issues.redhat.com/browse/NE-2496

That said there are two distinct changes happening in this PR:

  • Skipping dynamic change if scaling out and have just one replica
  • Adding inter keyword in the server-template, so the dynamically added server will use the same health check interval of the other replicas.

https://issues.redhat.com/browse/OCPBUGS-77773

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 24, 2026

@jcmoraisjr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gcs278
Copy link
Contributor

gcs278 commented Mar 24, 2026

Thanks for unit tests @jcmoraisjr! I think this a step forward in getting DCM stable/useable.

@Thealisyed feel free to LGTM again, but it was just unit tests that were added.

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 24, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants