Skip to content

[DNM] MCO-2249: Custom Pool Booting on Day 0 prototype#10528

Draft
djoshy wants to merge 4 commits intoopenshift:mainfrom
djoshy:custom-pool-day-0
Draft

[DNM] MCO-2249: Custom Pool Booting on Day 0 prototype#10528
djoshy wants to merge 4 commits intoopenshift:mainfrom
djoshy:custom-pool-day-0

Conversation

@djoshy
Copy link
Copy Markdown
Contributor

@djoshy djoshy commented Apr 30, 2026

Implementation details can be found in docs/design/custom-machine-pools-day0.md attached to this PR. Scrapes from a sample run below:

InstallConfig used:

(skipping the boring parts)
.....
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
- architecture: amd64
  hyperthreading: Enabled
  name: grace-blackwell
  platform:                                                                                                                                                                           
    gcp:                           
      osImage:                            
        project: rhcos-cloud           
        name: rhcos-9-6-20251212-1-gcp-x86-64
  replicas: 1
- architecture: amd64
  hyperthreading: Enabled
  name: vera-rubin
  platform:                                                                                                                                                                           
    gcp:                           
      osImage:                            
        project: rhcos-cloud           
        name: rhcos-9-8-20260403-0-gcp-x86-64
  replicas: 1      
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3

Note that I've specified a different bootimage for each of the custom pools. Post installation, we have:

~/projects/openshift-installer/download 
$ oc get node -n openshift-machine-config-operator
NAME                                           STATUS   ROLES                    AGE     VERSION
djoshy-dev-104-lfrz2-grace-blackwell-a-vmll8   Ready    grace-blackwell,worker   9m26s   v1.35.3
djoshy-dev-104-lfrz2-master-0                  Ready    control-plane,master     23m     v1.35.3
djoshy-dev-104-lfrz2-master-1                  Ready    control-plane,master     25m     v1.35.3
djoshy-dev-104-lfrz2-master-2                  Ready    control-plane,master     25m     v1.35.3
djoshy-dev-104-lfrz2-vera-rubin-a-jjttg        Ready    vera-rubin,worker        9m33s   v1.35.3
djoshy-dev-104-lfrz2-worker-a-77wlc            Ready    worker                   9m41s   v1.35.3
djoshy-dev-104-lfrz2-worker-b-f2mvt            Ready    worker                   9m36s   v1.35.3
djoshy-dev-104-lfrz2-worker-c-j92jt            Ready    worker                   10m     v1.35.3

~/projects/openshift-installer/download 
$ oc get mcp -n openshift-machine-config-operator
NAME              CONFIG                                                      UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
grace-blackwell   rendered-grace-blackwell-cffc9c9465ea8a10a54e8548cd2814dd   True      False      False      1              1                   1                     0                      26m
master            rendered-master-4430ec896cff75e9184474fb035c29fd            True      False      False      3              3                   3                     0                      21m
vera-rubin        rendered-vera-rubin-cffc9c9465ea8a10a54e8548cd2814dd        True      False      False      1              1                   1                     0                      26m
worker            rendered-worker-cffc9c9465ea8a10a54e8548cd2814dd            True      False      False      3              3                   3                     0                      21m

~/projects/openshift-installer/download 
$ oc get machinesets.machine.openshift.io,machines.machine.openshift.io -n openshift-machine-api 
NAME                                                                     DESIRED   CURRENT   READY   AVAILABLE   AGE
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-a   1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-b   0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-c   0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-a        1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-b        0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-c        0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-worker-a            1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-worker-b            1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-worker-c            1         1         1       1           26m

NAME                                                                        PHASE     TYPE            REGION     ZONE         AGE
machine.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-a-vmll8   Running   n2-standard-4   us-east4   us-east4-a   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-master-0                  Running   n2-standard-4   us-east4   us-east4-a   26m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-master-1                  Running   n2-standard-4   us-east4   us-east4-b   26m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-master-2                  Running   n2-standard-4   us-east4   us-east4-c   26m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-a-jjttg        Running   n2-standard-4   us-east4   us-east4-a   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-worker-a-77wlc            Running   n2-standard-4   us-east4   us-east4-a   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-worker-b-f2mvt            Running   n2-standard-4   us-east4   us-east4-b   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-worker-c-j92jt            Running   n2-standard-4   us-east4   us-east4-c   19m

Examining the aleph version on each node indicates they used the bootimages we have defined:

$ oc debug node/djoshy-dev-104-lfrz2-grace-blackwell-a-vmll8 -- chroot /host cat /sysroot/.coreos-aleph-version.json
{
     ......
    "version": "9.6.20251212-1"
}
$ oc debug node/djoshy-dev-104-lfrz2-vera-rubin-a-jjttg -- chroot /host cat /sysroot/.coreos-aleph-version.json
{
    .....
    "version": "9.8.20260403-0"
}

Both nodes pivoted to the rhel-9 stream after firstboot, as that is the default currently.

Summary by CodeRabbit

  • New Features
    • Introduced custom machine pools at Day 0 installation on GCP with DNS-compliant naming and a 5-pool limit per cluster.
    • Custom pools receive dedicated configuration, node selection, and ignition settings.
    • Worker replicas automatically balance when custom pools are added.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

Walkthrough

The changes introduce Day 0 custom machine pool support by adding validation for arbitrary DNS-label-compliant pool names (with platform and count restrictions), generating per-pool ignition configurations and MachineConfigPool manifests, integrating custom pool asset generation into the Worker asset, and auto-balancing worker replica counts when not explicitly specified.

Changes

Cohort / File(s) Summary
Design Documentation
docs/design/custom-machine-pools-day0.md
New design document specifying custom machine pool support including validation rules, bootstrap MachineConfigPool generation, per-pool ignition behavior, GCP-specific boot image handling, and replica auto-balancing logic.
Type Definitions
pkg/types/machinepools.go
Adds ReservedMachinePoolNames set and new IsCustomPool() helper function to identify pool names that are neither empty nor reserved (built-in roles).
Validation & Defaults
pkg/types/validation/installconfig.go, pkg/types/defaults/machinepools.go
validateCompute now allows custom pool names as DNS1123 labels (non-reserved, GCP-only, max 5 per install), while SetMachinePoolDefaults treats custom pools like edge/arbiter roles for default replica assignment (defaulting to 0).
Ignition Generation
pkg/asset/ignition/machine/custompool.go
New CustomPool asset generates per-pool ignition files by iterating compute entries, filtering custom pools, and producing <pool-name>.ign output files with marshaled ignition configurations.
MachineConfigPool Manifests
pkg/asset/machines/machineconfigpool/manifest.go
New ForCustomPool() and Manifest() functions create MachineConfigPool objects with role-based selectors (matchExpressions on ["worker", poolName]) and node-level placement via NodeSelector on pool name.
GCP Machine Support
pkg/asset/machines/gcp/machines.go
Updated getNetworks and getTags to handle custom pool roles by falling back to compute subnet and generating worker-compatible tags with role-specific suffixes, removing panics on unrecognized roles.
Worker Asset Integration
pkg/asset/machines/worker.go
Extends Worker asset with custom pool support: detects explicit worker replica counts, performs replica auto-balancing by subtracting custom pool totals when not explicit, generates per-pool GCP MachineSets with pool-specific user-data secrets and ignition paths (/config/<poolName>), and appends MachineConfigPool manifests and user-data secret files for each custom pool.
Test Updates
pkg/asset/machines/worker_test.go
Updates test setup to include machine.CustomPool{} as a dependency in asset parent collections.
Boot Image Handling
pkg/asset/manifests/mco.go
Modified gcpBootImages to scan all compute pools for a non-nil GCP OSImage instead of checking only the first entry, preserving early exit once a matching pool is found.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning New functionality in custompool.go and manifest.go lacks corresponding unit tests, representing a gap in test coverage. Create custompool_test.go and manifest_test.go to test the new asset type, methods, and exported functions with proper test coverage.
Title check ❓ Inconclusive The title references the main feature (custom pool booting on Day 0) but is prefixed with [DNM], indicating it's not meant to merge. Clarify the [DNM] prefix intention: if this is a draft/prototype PR, consider removing [DNM] if ready for review; if truly not-for-merge, confirm this is the intended final state.
✅ Passed checks (9 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The pull request does not introduce problematic test names. All existing test case names are static and descriptive without dynamic information.
Microshift Test Compatibility ✅ Passed PR does not introduce new Ginkgo e2e tests; only standard Go unit test modifications present.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not introduce any new Ginkgo e2e tests. The repository is the OpenShift installer, not a test suite, and changes consist of design documentation and installer asset generation code for custom machine pools support.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces infrastructure-level custom machine pools without pod scheduling constraints or HA topology assumptions.
Ote Binary Stdout Contract ✅ Passed No process-level stdout writes (init, main, TestMain, BeforeSuite, AfterSuite functions) found in any modified files.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. All test files use standard Go testing package rather than Ginkgo patterns, so the custom check is not applicable.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sdodson for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/asset/machines/worker.go (1)

342-378: 💤 Low value

Worker replica detection via raw YAML parsing is functional but fragile.

The approach of re-parsing the raw install-config YAML to detect whether replicas was explicitly set works, but couples this logic to the YAML structure. If the install-config parsing changes or the field is renamed, this could silently break.

Consider documenting this dependency or adding a test that verifies the detection works correctly when replicas are explicitly set vs defaulted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/asset/machines/worker.go` around lines 342 - 378, Add a unit test and
brief comment to make the fragile raw-YAML detection explicit: create tests for
the logic around workerReplicasExplicitlySet (the rawIC struct + yaml.Unmarshal
of installConfig.File.Data) that cover cases where the compute pool for
types.MachinePoolComputeRoleName explicitly sets replicas and where it relies on
defaults; in the code near the rawIC parsing add a short comment calling out the
dependency on the install-config YAML shape so future refactors notice the
coupling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/asset/machines/worker.go`:
- Around line 342-378: Add a unit test and brief comment to make the fragile
raw-YAML detection explicit: create tests for the logic around
workerReplicasExplicitlySet (the rawIC struct + yaml.Unmarshal of
installConfig.File.Data) that cover cases where the compute pool for
types.MachinePoolComputeRoleName explicitly sets replicas and where it relies on
defaults; in the code near the rawIC parsing add a short comment calling out the
dependency on the install-config YAML shape so future refactors notice the
coupling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 855c7881-ae9e-467e-aa6a-1e4c80873a82

📥 Commits

Reviewing files that changed from the base of the PR and between 73b340a and e46d8b2.

📒 Files selected for processing (10)
  • docs/design/custom-machine-pools-day0.md
  • pkg/asset/ignition/machine/custompool.go
  • pkg/asset/machines/gcp/machines.go
  • pkg/asset/machines/machineconfigpool/manifest.go
  • pkg/asset/machines/worker.go
  • pkg/asset/machines/worker_test.go
  • pkg/asset/manifests/mco.go
  • pkg/types/defaults/machinepools.go
  • pkg/types/machinepools.go
  • pkg/types/validation/installconfig.go

@djoshy djoshy changed the title [DNM] Custom Pool Booting on Day 0 prototype [DNM] MCO-2249: Custom Pool Booting on Day 0 prototype May 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented May 1, 2026

@djoshy: This pull request references MCO-2249 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Implementation details can be found in docs/design/custom-machine-pools-day0.md attached to this PR. Scrapes from a sample run below:

InstallConfig used:

(skipping the boring parts)
.....
compute:
- architecture: amd64
 hyperthreading: Enabled
 name: worker
 platform: {}
 replicas: 3
- architecture: amd64
 hyperthreading: Enabled
 name: grace-blackwell
 platform:                                                                                                                                                                           
   gcp:                           
     osImage:                            
       project: rhcos-cloud           
       name: rhcos-9-6-20251212-1-gcp-x86-64
 replicas: 1
- architecture: amd64
 hyperthreading: Enabled
 name: vera-rubin
 platform:                                                                                                                                                                           
   gcp:                           
     osImage:                            
       project: rhcos-cloud           
       name: rhcos-9-8-20260403-0-gcp-x86-64
 replicas: 1      
controlPlane:
 architecture: amd64
 hyperthreading: Enabled
 name: master
 platform: {}
 replicas: 3

Note that I've specified a different bootimage for each of the custom pools. Post installation, we have:

~/projects/openshift-installer/download 
$ oc get node -n openshift-machine-config-operator
NAME                                           STATUS   ROLES                    AGE     VERSION
djoshy-dev-104-lfrz2-grace-blackwell-a-vmll8   Ready    grace-blackwell,worker   9m26s   v1.35.3
djoshy-dev-104-lfrz2-master-0                  Ready    control-plane,master     23m     v1.35.3
djoshy-dev-104-lfrz2-master-1                  Ready    control-plane,master     25m     v1.35.3
djoshy-dev-104-lfrz2-master-2                  Ready    control-plane,master     25m     v1.35.3
djoshy-dev-104-lfrz2-vera-rubin-a-jjttg        Ready    vera-rubin,worker        9m33s   v1.35.3
djoshy-dev-104-lfrz2-worker-a-77wlc            Ready    worker                   9m41s   v1.35.3
djoshy-dev-104-lfrz2-worker-b-f2mvt            Ready    worker                   9m36s   v1.35.3
djoshy-dev-104-lfrz2-worker-c-j92jt            Ready    worker                   10m     v1.35.3

~/projects/openshift-installer/download 
$ oc get mcp -n openshift-machine-config-operator
NAME              CONFIG                                                      UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
grace-blackwell   rendered-grace-blackwell-cffc9c9465ea8a10a54e8548cd2814dd   True      False      False      1              1                   1                     0                      26m
master            rendered-master-4430ec896cff75e9184474fb035c29fd            True      False      False      3              3                   3                     0                      21m
vera-rubin        rendered-vera-rubin-cffc9c9465ea8a10a54e8548cd2814dd        True      False      False      1              1                   1                     0                      26m
worker            rendered-worker-cffc9c9465ea8a10a54e8548cd2814dd            True      False      False      3              3                   3                     0                      21m

~/projects/openshift-installer/download 
$ oc get machinesets.machine.openshift.io,machines.machine.openshift.io -n openshift-machine-api 
NAME                                                                     DESIRED   CURRENT   READY   AVAILABLE   AGE
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-a   1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-b   0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-c   0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-a        1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-b        0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-c        0         0                             26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-worker-a            1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-worker-b            1         1         1       1           26m
machineset.machine.openshift.io/djoshy-dev-104-lfrz2-worker-c            1         1         1       1           26m

NAME                                                                        PHASE     TYPE            REGION     ZONE         AGE
machine.machine.openshift.io/djoshy-dev-104-lfrz2-grace-blackwell-a-vmll8   Running   n2-standard-4   us-east4   us-east4-a   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-master-0                  Running   n2-standard-4   us-east4   us-east4-a   26m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-master-1                  Running   n2-standard-4   us-east4   us-east4-b   26m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-master-2                  Running   n2-standard-4   us-east4   us-east4-c   26m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-vera-rubin-a-jjttg        Running   n2-standard-4   us-east4   us-east4-a   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-worker-a-77wlc            Running   n2-standard-4   us-east4   us-east4-a   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-worker-b-f2mvt            Running   n2-standard-4   us-east4   us-east4-b   19m
machine.machine.openshift.io/djoshy-dev-104-lfrz2-worker-c-j92jt            Running   n2-standard-4   us-east4   us-east4-c   19m

Examining the aleph version on each node indicates they used the bootimages we have defined:

$ oc debug node/djoshy-dev-104-lfrz2-grace-blackwell-a-vmll8 -- chroot /host cat /sysroot/.coreos-aleph-version.json
{
    ......
   "version": "9.6.20251212-1"
}
$ oc debug node/djoshy-dev-104-lfrz2-vera-rubin-a-jjttg -- chroot /host cat /sysroot/.coreos-aleph-version.json
{
   .....
   "version": "9.8.20260403-0"
}

Both nodes pivoted to the rhel-9 stream after firstboot, as that is the default currently.

Summary by CodeRabbit

  • New Features
  • Introduced custom machine pools at Day 0 installation on GCP with DNS-compliant naming and a 5-pool limit per cluster.
  • Custom pools receive dedicated configuration, node selection, and ignition settings.
  • Worker replicas automatically balance when custom pools are added.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants