Skip to content

OCPBUGS-65645: Verify extension packages are installed post node reboot#6010

Open
isabella-janssen wants to merge 1 commit into
openshift:mainfrom
isabella-janssen:ocpbugs-65645-claude
Open

OCPBUGS-65645: Verify extension packages are installed post node reboot#6010
isabella-janssen wants to merge 1 commit into
openshift:mainfrom
isabella-janssen:ocpbugs-65645-claude

Conversation

@isabella-janssen
Copy link
Copy Markdown
Member

@isabella-janssen isabella-janssen commented May 6, 2026

Closes: OCPBUGS-65645

- What I did
Note that this PR was created with assistance from Claude.

This adds post-reboot verification to ensure extension packages are actually installed in the RPM database before marking a node as updated. This addresses cases where rpm-ostree reports a success, but packages aren't actually installed. This can lead to tricky debugging situations as was faced in OCPBUGS-63576 when extension installs seem to succeed, but do not. This implementation follows the pattern used in the existing extension's e2e test for verifying the installed packages.

installedPackages = helpers.ExecCmdOnNode(t, cs, infraNode, "chroot", "/rootfs", "rpm", "-q", "pacemaker", "pcs", "fence-agents-all", "libreswan", "usbguard", "kernel-devel", "kernel-headers", "kata-containers", "krb5-workstation", "libkadm5", "sysstat")

- How to verify it

  1. Launch a cluster with this fix included.
  2. Apply an MC to install an extension.
  3. Force the install to fail due to a package missing. I mocked this by using DNM - for testing OCPBUGS-65645 #6016.
  4. Wait for the MCP to degrade.
$ oc describe mcp infra
Name:         infra
...
Status:
  Conditions:
...
    Last Transition Time:  2026-05-12T18:07:49Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2026-05-12T18:07:49Z
    Message:               All nodes are updating to MachineConfig rendered-infra-ed47406e1f118becaf627d34a14ca10d
    Reason:                
    Status:                True
    Type:                  Updating
    Last Transition Time:  2026-05-12T18:11:17Z
    Message:               Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv upgrade failure. extension package verification failed: the following extension packages are missing from the RPM database: [sysstat].", Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "extension package verification failed: the following extension packages are missing from the RPM database: [sysstat]."
    Reason:                1 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2026-05-12T18:11:17Z
    Message:               Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv upgrade failure. extension package verification failed: the following extension packages are missing from the RPM database: [sysstat].", Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "extension package verification failed: the following extension packages are missing from the RPM database: [sysstat]."
    Reason:                
    Status:                True
    Type:                  Degraded
  Degraded Machine Count:  1
  Machine Count:           1
...
  1. See the degrade errors in the MCD logs.
$ oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-g7lzz
...
I0512 18:16:18.069431    2671 update.go:1872] Verifying 1 extension packages are installed for config rendered-infra-ed47406e1f118becaf627d34a14ca10d
W0512 18:16:18.084937    2671 update.go:1889] Extension package sysstat not found in RPM database
E0512 18:16:18.085001    2671 writer.go:231] Marking Degraded due to: "extension package verification failed: the following extension packages are missing from the RPM database: [sysstat]."
I0512 18:16:18.094924    2671 daemon.go:643] Error syncing node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv (retries 24): extension package verification failed: the following extension packages are missing from the RPM database: [sysstat].

- Description for the changelog
OCPBUGS-65645: Verify extension packages are installed post node reboot

Summary by CodeRabbit

  • Improvements
    • Added a CoreOS-only post-reboot verification that ensures configured extension packages are present before finalizing an update.
    • If required packages are missing, the update will halt and report a clear error to aid troubleshooting.
    • Enhanced success/failure diagnostics for extension package verification to make outcomes more transparent.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

When the daemon finds the node already in the desired config (inDesiredConfig/resumed path), it now invokes CoreOSDaemon.verifyExtensionPackages(state.currentConfig); verification failures cause updateConfigAndState to return an error and abort the resumed completion flow.

Changes

Extension Package Verification

Layer / File(s) Summary
Verification Helper
pkg/daemon/update.go
Adds func (dn *CoreOSDaemon) verifyExtensionPackages(config *mcfgv1.MachineConfig) error that maps config.Spec.Extensions to expected RPM package names, runs rpm -q (chrooted) for each, aggregates missing packages when rpm -q exits with code 1 into one error, fails fast on other rpm errors, and logs success when all packages are present.
Update Flow Integration
pkg/daemon/daemon.go
In updateConfigAndState (inDesiredConfig/resumed path), after detecting desired config, casts dn to CoreOSDaemon and invokes verifyExtensionPackages(state.currentConfig); on error, returns early and aborts the completion sequence.

Sequence Diagram

sequenceDiagram
    participant UpdateFlow as Update Flow
    participant CoreOS as CoreOSDaemon.verifyExtensionPackages
    participant Config as MachineConfig
    participant RPM as RPM DB (chrooted)

    UpdateFlow->>UpdateFlow: Enter inDesiredConfig / resumed path
    UpdateFlow->>CoreOS: verifyExtensionPackages(currentConfig)
    CoreOS->>Config: Read Spec.Extensions
    loop for each expected package
        CoreOS->>RPM: rpm -q <pkg>
        RPM-->>CoreOS: exit code / output
    end
    alt All packages present
        CoreOS-->>UpdateFlow: success
        UpdateFlow->>UpdateFlow: continue resumed completion
    else Missing packages (exit code 1)
        CoreOS-->>UpdateFlow: error (aggregated missing pkgs)
        UpdateFlow->>UpdateFlow: abort completion (return error)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only production code files (daemon.go, update.go). No test files or Ginkgo test definitions are changed. The custom check for stable test names does not apply.
Test Structure And Quality ✅ Passed Not applicable. The PR adds functional code to pkg/daemon but does not include any Ginkgo tests. The codebase uses standard Go testing patterns, not Ginkgo, for these modules.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to daemon implementation code that adds extension package verification. Check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Changes are limited to production daemon code (pkg/daemon/daemon.go and pkg/daemon/update.go). The check only applies to new e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds daemon-level extension package verification. No scheduling constraints, affinity rules, topology spread, node selectors, or PDBs are introduced. Changes are purely local node verification.
Ote Binary Stdout Contract ✅ Passed PR changes operational daemon code, not OTE test framework. New method uses klog (stderr default), no stdout writes in process-level code. Not applicable to OTE binary stdout contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes are limited to daemon implementation files (daemon.go and update.go). IPv6/disconnected network check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: adding post-reboot verification for extension packages, which matches the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: isabella-janssen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/daemon/daemon.go`:
- Line 2374: The nil comparison on dn.os is invalid because dn.os is declared as
a struct (osrelease.OperatingSystem) not a pointer; remove the dn.os != nil
check and directly call dn.os.IsCoreOSVariant() in the conditional (i.e.,
replace the if condition that references dn.os != nil && dn.os.IsCoreOSVariant()
with a single call to dn.os.IsCoreOSVariant()). Ensure no other code assumes
dn.os can be nil and adjust any related logic accordingly.

In `@pkg/daemon/update.go`:
- Around line 1876-1881: Remove the unnecessary inner chroot and differentiate
rpm execution failures from a genuine "package missing" exit; replace
exec.Command("chroot","/rootfs","rpm","-q",pkg) with
exec.Command("rpm","-q",pkg), run the command, and if it returns an
exec.ExitError inspect the exit code (use the ProcessState.ExitCode or
type-assert to *exec.ExitError) — treat exit code 1 as "package missing" (append
to missingPackages and klog.Warningf as before) but for any other non-zero exit
code or non-ExitError log an error via klog.Errorf including the actual error
details (do not classify those as missingPackages); keep existing behavior for
successful runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f31a60a2-cf6d-43a3-afc0-b8aa2da75f9b

📥 Commits

Reviewing files that changed from the base of the PR and between 48d0c9d and 8891fb7.

📒 Files selected for processing (2)
  • pkg/daemon/daemon.go
  • pkg/daemon/update.go

Comment thread pkg/daemon/daemon.go Outdated
Comment thread pkg/daemon/update.go Outdated
@isabella-janssen isabella-janssen force-pushed the ocpbugs-65645-claude branch 2 times, most recently from 029ff9c to 5626213 Compare May 7, 2026 13:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/daemon/daemon.go (1)

2374-2374: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove invalid nil check on dn.os before extension verification.

Line 2374 compares dn.os to nil, but dn.os is a struct (osrelease.OperatingSystem), so this does not compile. Use dn.os.IsCoreOSVariant() directly.

Minimal fix
-		if dn.os != nil && dn.os.IsCoreOSVariant() {
+		if dn.os.IsCoreOSVariant() {
 			coreOSDaemon := CoreOSDaemon{dn}
 			if err := coreOSDaemon.verifyExtensionPackages(state.currentConfig); err != nil {
 				return missingODC, inDesiredConfig, fmt.Errorf("extension package verification failed: %w", err)
 			}
 		}
#!/bin/bash
set -euo pipefail

# Verify field type and invalid nil-comparison site.
rg -n -C2 'os osrelease\.OperatingSystem|dn\.os != nil|IsCoreOSVariant\(\)' pkg/daemon/daemon.go
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/daemon/daemon.go` at line 2374, Remove the invalid nil check against
dn.os (which is an osrelease.OperatingSystem value) and call the method
directly; replace the conditional using "dn.os != nil &&
dn.os.IsCoreOSVariant()" with just "dn.os.IsCoreOSVariant()" (references: dn.os,
IsCoreOSVariant, osrelease.OperatingSystem) so the code compiles and uses the
struct receiver correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pkg/daemon/daemon.go`:
- Line 2374: Remove the invalid nil check against dn.os (which is an
osrelease.OperatingSystem value) and call the method directly; replace the
conditional using "dn.os != nil && dn.os.IsCoreOSVariant()" with just
"dn.os.IsCoreOSVariant()" (references: dn.os, IsCoreOSVariant,
osrelease.OperatingSystem) so the code compiles and uses the struct receiver
correctly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3a146b07-94ac-4548-9335-44dcfef01600

📥 Commits

Reviewing files that changed from the base of the PR and between 8891fb7 and 029ff9c.

📒 Files selected for processing (2)
  • pkg/daemon/daemon.go
  • pkg/daemon/update.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/daemon/daemon.go (1)

2358-2379: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run extension verification before publishing MachineConfigNodeResumed=true.

The resumed MCN condition is applied before verifyExtensionPackages. If verification fails, this path returns an error after a success-like condition was already published.

Please move the verification block before the resumed-condition update so success state is emitted only after the gate passes.

Suggested reorder
-		err = upgrademonitor.GenerateAndApplyMachineConfigNodes(
-			&upgrademonitor.Condition{State: mcfgv1.MachineConfigNodeResumed, Reason: string(mcfgv1.MachineConfigNodeResumed), Message: fmt.Sprintf("In desired config %s. Resumed normal operations. Applying proper annotations.", state.currentConfig.Name)},
-			nil,
-			metav1.ConditionTrue,
-			metav1.ConditionFalse,
-			dn.node,
-			dn.mcfgClient,
-			dn.fgHandler,
-			pool,
-		)
-		if err != nil {
-			klog.Errorf("Error making MCN for Resumed true: %v", err)
-		}
-
 		// Verify extension packages are actually installed before marking as done
 		// See: https://redhat.atlassian.net/browse/OCPBUGS-65645
 		if dn.os.IsCoreOSVariant() {
 			coreOSDaemon := CoreOSDaemon{dn}
 			if err := coreOSDaemon.verifyExtensionPackages(state.currentConfig); err != nil {
 				return missingODC, inDesiredConfig, fmt.Errorf("extension package verification failed: %w", err)
 			}
 		}
+
+		err = upgrademonitor.GenerateAndApplyMachineConfigNodes(
+			&upgrademonitor.Condition{State: mcfgv1.MachineConfigNodeResumed, Reason: string(mcfgv1.MachineConfigNodeResumed), Message: fmt.Sprintf("In desired config %s. Resumed normal operations. Applying proper annotations.", state.currentConfig.Name)},
+			nil,
+			metav1.ConditionTrue,
+			metav1.ConditionFalse,
+			dn.node,
+			dn.mcfgClient,
+			dn.fgHandler,
+			pool,
+		)
+		if err != nil {
+			klog.Errorf("Error making MCN for Resumed true: %v", err)
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/daemon/daemon.go` around lines 2358 - 2379, The MachineConfigNodeResumed
condition is being published before extension package verification; move the
CoreOS-specific verification (invoke
CoreOSDaemon{dn}.verifyExtensionPackages(state.currentConfig)) to run before
calling upgrademonitor.GenerateAndApplyMachineConfigNodes that emits the
MachineConfigNodeResumed condition so that if verifyExtensionPackages returns an
error you return early and never call GenerateAndApplyMachineConfigNodes; update
the error paths accordingly (keep the same error return message) and ensure the
symbols involved are CoreOSDaemon.verifyExtensionPackages, state.currentConfig,
upgrademonitor.GenerateAndApplyMachineConfigNodes, and
mcfgv1.MachineConfigNodeResumed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/daemon/daemon.go`:
- Around line 2358-2379: The MachineConfigNodeResumed condition is being
published before extension package verification; move the CoreOS-specific
verification (invoke
CoreOSDaemon{dn}.verifyExtensionPackages(state.currentConfig)) to run before
calling upgrademonitor.GenerateAndApplyMachineConfigNodes that emits the
MachineConfigNodeResumed condition so that if verifyExtensionPackages returns an
error you return early and never call GenerateAndApplyMachineConfigNodes; update
the error paths accordingly (keep the same error return message) and ensure the
symbols involved are CoreOSDaemon.verifyExtensionPackages, state.currentConfig,
upgrademonitor.GenerateAndApplyMachineConfigNodes, and
mcfgv1.MachineConfigNodeResumed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3e91730-a566-4537-82ce-0a73fda2f778

📥 Commits

Reviewing files that changed from the base of the PR and between 029ff9c and 5626213.

📒 Files selected for processing (2)
  • pkg/daemon/daemon.go
  • pkg/daemon/update.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/daemon/update.go

@isabella-janssen isabella-janssen force-pushed the ocpbugs-65645-claude branch 2 times, most recently from 00443ac to 249d186 Compare May 7, 2026 17:42
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/daemon/daemon.go (1)

2354-2358: 💤 Low value

Add a log line before (and after) the extension package verification for observability.

Without a log entry, when a node degrades with "extension package verification failed" the only trace that verification was even attempted comes from the error itself. A single klog.Infof before the call (and a matching success log) makes it much easier to distinguish "verification ran and found missing packages" from other degradation causes.

♻️ Suggested observability improvement
+		klog.Infof("Verifying extension packages for config %s", state.currentConfig.GetName())
 		if dn.os.IsCoreOSVariant() {
 			coreOSDaemon := CoreOSDaemon{dn}
 			if err := coreOSDaemon.verifyExtensionPackages(state.currentConfig); err != nil {
 				return missingODC, inDesiredConfig, fmt.Errorf("extension package verification failed: %w", err)
 			}
+			klog.Infof("Extension package verification passed for config %s", state.currentConfig.GetName())
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/daemon/daemon.go` around lines 2354 - 2358, Add observability logs around
the CoreOS extension package verification: before calling
coreOSDaemon.verifyExtensionPackages(state.currentConfig) emit a klog.Infof
indicating that extension package verification is starting for the node (include
context such as node ID or config reference if available), and after the call
(on success) emit a klog.Infof noting verification succeeded; keep the existing
error return path unchanged so failures still return the formatted error from
coreOSDaemon.verifyExtensionPackages.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/daemon/daemon.go`:
- Around line 2354-2358: Add observability logs around the CoreOS extension
package verification: before calling
coreOSDaemon.verifyExtensionPackages(state.currentConfig) emit a klog.Infof
indicating that extension package verification is starting for the node (include
context such as node ID or config reference if available), and after the call
(on success) emit a klog.Infof noting verification succeeded; keep the existing
error return path unchanged so failures still return the formatted error from
coreOSDaemon.verifyExtensionPackages.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a638cd4c-9338-4955-bad2-78502469dec9

📥 Commits

Reviewing files that changed from the base of the PR and between 00443ac and 249d186.

📒 Files selected for processing (2)
  • pkg/daemon/daemon.go
  • pkg/daemon/update.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/daemon/update.go

@isabella-janssen isabella-janssen changed the title (WIP) OCPBUGS-65645 OCPBUGS-65645: Verify extension packages are installed post node reboot May 12, 2026
@isabella-janssen isabella-janssen marked this pull request as ready for review May 12, 2026 19:07
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-65645, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Closes: OCPBUGS-65645

- What I did
Note that this PR was created with assistance from Claude.

This adds post-reboot verification to ensure extension packages are actually installed in the RPM database before marking a node as updated. This addresses cases where rpm-ostree reports a success, but packages aren't actually installed. This can lead to tricky debugging situations as was faced in OCPBUGS-63576 when extension installs seem to succeed, but do not. This implementation follows the pattern used in the existing extension's e2e test for verifying the installed packages.

installedPackages = helpers.ExecCmdOnNode(t, cs, infraNode, "chroot", "/rootfs", "rpm", "-q", "pacemaker", "pcs", "fence-agents-all", "libreswan", "usbguard", "kernel-devel", "kernel-headers", "kata-containers", "krb5-workstation", "libkadm5", "sysstat")

- How to verify it

  1. Launch a cluster with this fix included.
  2. Apply an MC to install an extension.
  3. Force the install to fail due to a package missing. I mocked this by using DNM - for testing OCPBUGS-65645 #6016.
  4. Wait for the MCP to degrade.
$ oc describe mcp infra
Name:         infra
...
Status:
 Conditions:
...
   Last Transition Time:  2026-05-12T18:07:49Z
   Message:               
   Reason:                
   Status:                False
   Type:                  Updated
   Last Transition Time:  2026-05-12T18:07:49Z
   Message:               All nodes are updating to MachineConfig rendered-infra-ed47406e1f118becaf627d34a14ca10d
   Reason:                
   Status:                True
   Type:                  Updating
   Last Transition Time:  2026-05-12T18:11:17Z
   Message:               Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv upgrade failure. extension package verification failed: the following extension packages are missing from the RPM database: [sysstat].", Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "extension package verification failed: the following extension packages are missing from the RPM database: [sysstat]."
   Reason:                1 nodes are reporting degraded status on sync
   Status:                True
   Type:                  NodeDegraded
   Last Transition Time:  2026-05-12T18:11:17Z
   Message:               Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv upgrade failure. extension package verification failed: the following extension packages are missing from the RPM database: [sysstat].", Node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv is reporting: "extension package verification failed: the following extension packages are missing from the RPM database: [sysstat]."
   Reason:                
   Status:                True
   Type:                  Degraded
 Degraded Machine Count:  1
 Machine Count:           1
...
  1. See the degrade errors in the MCD logs.
$ oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-g7lzz
...
I0512 18:16:18.069431    2671 update.go:1872] Verifying 1 extension packages are installed for config rendered-infra-ed47406e1f118becaf627d34a14ca10d
W0512 18:16:18.084937    2671 update.go:1889] Extension package sysstat not found in RPM database
E0512 18:16:18.085001    2671 writer.go:231] Marking Degraded due to: "extension package verification failed: the following extension packages are missing from the RPM database: [sysstat]."
I0512 18:16:18.094924    2671 daemon.go:643] Error syncing node ci-ln-6srgi0b-72292-8vmxx-worker-b-5nnnv (retries 24): extension package verification failed: the following extension packages are missing from the RPM database: [sysstat].

- Description for the changelog
OCPBUGS-65645: Verify extension packages are installed post node reboot

Summary by CodeRabbit

  • Improvements
  • Added a CoreOS-only post-reboot verification that ensures configured extension packages are present before finalizing an update.
  • If required packages are missing, the update will halt and report a clear error to aid troubleshooting.
  • Enhanced success/failure diagnostics for extension package verification to make outcomes more transparent.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
@openshift-ci openshift-ci Bot requested review from djoshy and umohnani8 May 12, 2026 19:08
@isabella-janssen
Copy link
Copy Markdown
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 12, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@isabella-janssen: This pull request references Jira Issue OCPBUGS-65645, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from sergiordlr May 12, 2026 19:08
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@isabella-janssen
Copy link
Copy Markdown
Member Author

/test unit

1 similar comment
@isabella-janssen
Copy link
Copy Markdown
Member Author

/test unit

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

@isabella-janssen: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants