Skip to content

OCPBUGS-77839: Fix node degrades due to file and OS update failures#5744

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
isabella-janssen:ocpbugs-77839
Mar 7, 2026
Merged

OCPBUGS-77839: Fix node degrades due to file and OS update failures#5744
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
isabella-janssen:ocpbugs-77839

Conversation

@isabella-janssen
Copy link
Member

@isabella-janssen isabella-janssen commented Mar 5, 2026

Closes: OCPBUGS-77839

- What I did
This updates the err variable assignment so it is not inaccurately reset on file and OS update failures.

- How to verify it
Manually:
In a tech preview cluster, MCPs should properly degrade when an upgrade fails on a file or OS update.

With payloads:
The Should properly report MCN conditions on node degrade test should pass when running the MCO's tech preview disruptive suite.

- Description for the changelog
OCPBUGS-77839: Fix node degrades due to file and OS update failures

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2026
@isabella-janssen
Copy link
Member Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2 periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@isabella-janssen: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-1of2
  • periodic-ci-openshift-machine-config-operator-release-4.21-periodics-e2e-aws-mco-disruptive-techpreview-2of2

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1f3b6c80-18a3-11f1-84c1-ae92b5479e4b-0

@isabella-janssen
Copy link
Member Author

/payload-job periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-1of3 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3 periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-3of3

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@isabella-janssen: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-1of3
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3
  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-3of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0175f40-18a3-11f1-8d17-3e721c3f0d52-0

@isabella-janssen isabella-janssen changed the title (WIP) OCPBUGS-77839: Fix node degrades due to file and OS update failures Mar 5, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 5, 2026
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-77839, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Closes: OCPBUGS-77839

- What I did
This updates the err variable assignment so it is not inaccurately reset on file and OS update failures.

- How to verify it
The Should properly report MCN conditions on node degrade test should pass when running locally against a tech preview cluster and in the MCO's tech preview disruptive suite.

- Description for the changelog
OCPBUGS-77839: Fix node degrades due to file and OS update failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 5, 2026
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-77839, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Closes: OCPBUGS-77839

- What I did
This updates the err variable assignment so it is not inaccurately reset on file and OS update failures.

- How to verify it
Manually:
In a tech preview cluster, MCPs should properly degrade when an upgrade fails on a file or OS update.

With payloads:
The Should properly report MCN conditions on node degrade test should pass when running the MCO's tech preview disruptive suite.

- Description for the changelog
OCPBUGS-77839: Fix node degrades due to file and OS update failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen
Copy link
Member Author

/payload-with-prs periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3 openshift/origin#30841

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@isabella-janssen: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@isabella-janssen
Copy link
Member Author

/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3 openshift/origin#30841

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f4504140-18ba-11f1-8b71-b09bebb82348-0

@isabella-janssen
Copy link
Member Author

/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3 openshift/origin#30841

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7cc698d0-18c5-11f1-8a6e-27135da56f0c-0

@isabella-janssen
Copy link
Member Author

/payload-job-with-prs periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3 openshift/origin#30841

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 5, 2026

@isabella-janssen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-machine-config-operator-release-4.22-periodics-e2e-aws-mco-disruptive-techpreview-2of3

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5cb34500-18d1-11f1-8d75-08a9c4281b29-0

@isabella-janssen isabella-janssen marked this pull request as ready for review March 5, 2026 20:35
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 5, 2026
@isabella-janssen
Copy link
Member Author

/verified by payloads

See #5744 (comment)

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 6, 2026
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: This PR has been marked as verified by payloads.

Details

In response to this:

/verified by payloads

See #5744 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen
Copy link
Member Author

/retest-required

// When ImageModeStatusReporting is enabled, update the `MachineConfigNodeUpdateFiles` condition to report the experienced error
if imageModeStatusReportingEnabled {
err = upgrademonitor.GenerateAndApplyMachineConfigNodes(
mcnErr := upgrademonitor.GenerateAndApplyMachineConfigNodes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces some changes I'm not sure we want. So, do we really want to ignore errors of GenerateAndApplyMachineConfigNodes? If so, ok, this is fine, but, if we want to fail the call to updateFiles and properly report to the caller without loosing the original error I'd say it's better to:

if mcnErr != nil {
  err = errors.Join(err, fmt.Errorf("eError making MCN for failed file apply: %v", err))
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding your concern here, the change here should better match the existing MCN failure case. Here's an existing similar situation:

if reconcilableError != nil {
Nerr := upgrademonitor.GenerateAndApplyMachineConfigNodes(
&upgrademonitor.Condition{State: mcfgv1.MachineConfigNodeUpdatePrepared, Reason: string(mcfgv1.MachineConfigNodeUpdatePrepared), Message: fmt.Sprintf("Update Failed compatibility validation. MachineConfigs %v and %v are not compatible. Err: %s", oldConfigName, newConfigName, reconcilableError.Error())},
nil,
metav1.ConditionUnknown,
metav1.ConditionFalse,
dn.node,
dn.mcfgClient,
dn.fgHandler,
pool,
)
if Nerr != nil {
klog.Errorf("Error making MCN for Preparing update failed: %v", err)
}
wrappedErr := fmt.Errorf("can't reconcile config %s with %s: %w", oldConfigName, newConfigName, reconcilableError)
if dn.nodeWriter != nil {
dn.nodeWriter.Eventf(corev1.EventTypeWarning, "FailedToReconcile", wrappedErr.Error())
}
return &unreconcilableErr{wrappedErr}
}

The example is logging the MCN apply failure (not treating it as a fatal error) then returning only the reconcilableError that was triggering the need for the MCN update. This new change should match that pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, so we area already just printing those errors, ok, I'm fine in that case.

@pablintino
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen, pablintino

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [isabella-janssen,pablintino]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@isabella-janssen
Copy link
Member Author

/test unit

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2026

@isabella-janssen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl-part2 88c2def link false /test e2e-gcp-op-ocl-part2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@isabella-janssen
Copy link
Member Author

/retest-required

@openshift-merge-bot openshift-merge-bot bot merged commit d5dfdcf into openshift:main Mar 7, 2026
16 of 17 checks passed
@openshift-ci-robot
Copy link
Contributor

@isabella-janssen: Jira Issue OCPBUGS-77839: Some pull requests linked via external trackers have merged:

The following pull request, linked via external tracker, has not merged:

All associated pull requests must be merged or unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-77839 has not been moved to the MODIFIED state.

This PR is marked as verified. If the remaining PRs listed above are marked as verified before merging, the issue will automatically be moved to VERIFIED after all of the changes from the PRs are available in an accepted nightly payload.

Details

In response to this:

Closes: OCPBUGS-77839

- What I did
This updates the err variable assignment so it is not inaccurately reset on file and OS update failures.

- How to verify it
Manually:
In a tech preview cluster, MCPs should properly degrade when an upgrade fails on a file or OS update.

With payloads:
The Should properly report MCN conditions on node degrade test should pass when running the MCO's tech preview disruptive suite.

- Description for the changelog
OCPBUGS-77839: Fix node degrades due to file and OS update failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@isabella-janssen isabella-janssen deleted the ocpbugs-77839 branch March 8, 2026 16:01
@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.22.0-0.nightly-2026-03-10-100251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants