Skip to content

METAL-156: Use Live ISO for baremetal bootstrap VM#9814

Merged
openshift-merge-bot[bot] merged 14 commits intoopenshift:mainfrom
zaneb:ipi-live-iso
Mar 10, 2026
Merged

METAL-156: Use Live ISO for baremetal bootstrap VM#9814
openshift-merge-bot[bot] merged 14 commits intoopenshift:mainfrom
zaneb:ipi-live-iso

Conversation

@zaneb
Copy link
Copy Markdown
Member

@zaneb zaneb commented Jul 1, 2025

Instead of downloading an RHCOS qemu image for the baremetal bootstrap - which disconnected users have to mirror locally - reuse the agent installer code for fetching a live ISO.

If the oc binary is present, the live ISO will be retrieved from the release image (respecting the mirror config), which means disconnected users have nothing else to do. In the case that oc is not present, we fall back to downloading the image directly (this requires a connection to the Internet).

In any case, the old qemu image is no longer used, and if a mirror URL is configured in the install-config it will be ignored.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 1, 2025
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Jul 1, 2025

@zaneb: This pull request references METAL-156 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

Details

In response to this:

Instead of downloading an RHCOS qemu image for the baremetal bootstrap - which disconnected users have to mirror locally - reuse the agent installer code for fetching a live ISO.

If the oc binary is present, the live ISO will be retrieved from the release image (respecting the mirror config), which means disconnected users have nothing else to do. In the case that oc is not present, we fall back to downloading the image directly (this requires a connection to the Internet).

In any case, the old qemu image is no longer used, and if a mirror URL is configured in the install-config it will be ignored.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from bfournie and dtantsur July 1, 2025 12:03
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jul 2, 2025

/cc @bfournie @honza
Virtualmedia will continue to fail (which is breaking all of the e2e metal jobs) until openshift/image-customization-controller#143 merges, but hopefully will be OK once that is available.

@openshift-ci openshift-ci Bot requested a review from honza July 2, 2025 20:29
@zaneb zaneb force-pushed the ipi-live-iso branch 2 times, most recently from 4476d96 to 400865d Compare July 15, 2025 09:42
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jul 15, 2025

LOL, virtualmedia now works (thanks to openshift/image-customization-controller#143), but PXE (which previously worked) has started failing with:

ironic.common.exception.ImageRefValidationFailed: Validation of image href file:///images/pxe/vmlinuz failed, reason: Security: Path /images/pxe/vmlinuz is not allowed for image source file URLs

due to the fix for OSSA-2025-001.

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jul 16, 2025

/retest-required

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Jul 16, 2025

/test e2e-metal-ipi-ovn-dualstack

@dtantsur
Copy link
Copy Markdown
Member

/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Aug 26, 2025

/retest

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Sep 8, 2025

Tests should be fixed again by openshift/image-customization-controller#147 when it lands.

@zaneb zaneb force-pushed the ipi-live-iso branch 2 times, most recently from 4d939c2 to 12107f3 Compare October 15, 2025 10:48
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Oct 16, 2025

PXE is broken again because metal3-io/ironic-image#681 overrode the default file_url_allowed_paths, and this was merged into our fork in openshift/ironic-image#673.

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Oct 16, 2025

openshift/ironic-image#702 should get PXE working again, if nothing else breaks it in the meantime.

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Oct 20, 2025

/retest

1 similar comment
@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Oct 21, 2025

/retest

Environment="XDG_RUNTIME_DIR=/run/user/${UID}"
Environment="KUBECONFIG=/opt/openshift/auth/kubeconfig-loopback"
Environment="DEPLOY_KERNEL_URL=file:///shared/html/images/ironic-python-agent.kernel"
Environment="DEPLOY_KERNEL_URL=file:///var/lib/ironic/pxe/vmlinuz"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super critical, but nowadays this variable should be passed to Ironic, not BMO (although I haven't checked if the downstream fork already has the required change).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not being passed to ironic now, so I guess that change hasn't made it to the bootstrap yet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either Ironic or BMO, both work. But the future-proof way is to pass it to Ironic. No urgency, we'll clean it up one day.

@dtantsur
Copy link
Copy Markdown
Member

/lgtm

from metal platform perspective (but you have a merge conflict again)

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 10, 2025
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Nov 10, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 11, 2025
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2026
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2026
zaneb added 14 commits March 9, 2026 13:08
Separate the non-agent-specific parts out into the rhcos package,
leaving the agent-specific parts behind in the agent/image package.
Use the same struct to marshal and unmarshal. Presumably this wasn't
previously possible when going through actual Terraform.
The amount of storage assigned to the Live /dev/loop0 partition in a
Live image boot is proportional to the total amount of RAM. With a 6GiB
VM, /dev/loop0 gets a little less than 3GiB.

After running the node-image-pull.service there is typically about
2.8GiB used on /dev/loop0 (of which <200MiB was used before running the
service). In a 6GiB VM this causes the "ostree container image pull"
process to fail due to insufficient free space on the disk.

20GiB of RAM allows us to pull both the ostree container image and the
necessary container images to run the metal bootstrap services.
Instead of downloading an RHCOS qemu image for the baremetal bootstrap -
which disconnected users have to mirror locally - reuse the agent
installer code for fetching a live ISO.

If the `oc` binary is present, the live ISO will be retrieved from the
release image (respecting the mirror config), which means disconnected
users have nothing else to do.

In the case that `oc` is not present, we fall back to downloading the
image directly (this requires a connection to the Internet).

In any case, the old qemu image is no longer used, and if a mirror URL
is configured in the install-config it will be ignored.
If the cluster is FIPS mode, we should enable FIPS on the baremetal
bootstrap VM also.
When the bootstrap is booted from a live ISO, we have access to both the
ISO image and the PXE boot components in it directly from the host
filesystem; there is no need to extract it again from the release image.
This saves a lot of RAM in the live environment.

We must pass SecurityLabelDisable=true because the mounted live ISO is a
read-only filesystem, and therefore we are unable to relabel the files
with the necessary SELinux context.
We have almost 9GiB of data to store in /var (mostly container images in
the crio storage), and since only 50% of RAM is used for the tmpfs this
requires an inordinate amount of RAM.

Create a volume for mounting as /var so that none of this data ends up
in the tmpfs. This allows us to return to the original RAM allocation of
6GiB. Avoid creating a new tmpfs for node-image-pull.service to pull the
ostree repo, and instead let it land on disk.

This is added to the ignition during the creation of the VM, so other
use cases for the baremetal bootstrap ignition (e.g. assisted installer)
cannot be affected.
This has not been required since 4.10, as we now can deploy the RHCOS
live ISO to disk directly, and there is no need for a separate qcow
image.
This field is neither required nor used if it is provided.
Doing systemctl isolate results in killing node-image-finish.service
itself, which is thereafter reported as a failed unit. Stop the unit
without killing the command so that this does not show up as a failure.
Avoid modifying the bootstrap user ignition, and instead add the disk
configuration for the bootstrap VM to the system ignition.
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 9, 2026
@honza
Copy link
Copy Markdown
Member

honza commented Mar 9, 2026

/lgtm

@zaneb
Copy link
Copy Markdown
Member Author

zaneb commented Mar 9, 2026

/retest-required

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 9, 2026

@zaneb: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-resourcegroup 3e58376 link false /test e2e-azure-ovn-resourcegroup
ci/prow/e2e-vsphere-host-groups-ovn-techpreview 3e58376 link false /test e2e-vsphere-host-groups-ovn-techpreview
ci/prow/okd-scos-e2e-vsphere-ovn 0f695f1 link false /test okd-scos-e2e-vsphere-ovn
ci/prow/e2e-agent-compact-ipv4-iso-no-registry 08a9d09 link false /test e2e-agent-compact-ipv4-iso-no-registry
ci/prow/e2e-metal-assisted 08a9d09 link false /test e2e-metal-assisted
ci/prow/e2e-metal-ovn-two-node-arbiter 08a9d09 link false /test e2e-metal-ovn-two-node-arbiter
ci/prow/e2e-metal-ovn-two-node-fencing 08a9d09 link false /test e2e-metal-ovn-two-node-fencing
ci/prow/e2e-metal-ipi-ovn-dualstack 08a9d09 link false /test e2e-metal-ipi-ovn-dualstack

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sgoveas
Copy link
Copy Markdown

sgoveas commented Mar 9, 2026

/verified by sgoveas link

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 9, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@sgoveas: This PR has been marked as verified by sgoveas [link](https://issues.redhat.com/browse/METAL-156?focusedCommentId=29229244&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-29229244).

Details

In response to this:

/verified by sgoveas link

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@honza
Copy link
Copy Markdown
Member

honza commented Mar 9, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit ed29c38 into openshift:main Mar 10, 2026
33 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants