Skip to content

Conversation

@stuggi
Copy link
Contributor

@stuggi stuggi commented Jan 5, 2026

Problem:
When deploying the openstack-operator on clusters where the OpenShift Machine Config Operator (MCO) is not installed, the openstackdataplanenodeset controller fails to start with the error:

ERROR setup problem running manager {"error": "failed to wait for
openstackdataplanenodeset caches to sync: timed out waiting for cache
to be synced for Kind *v1.MachineConfig"}

This occurs because the controller unconditionally sets up a watch for MachineConfig resources in SetupWithManager(). When the MachineConfig CRD doesn't exist (e.g., on non-OpenShift Kubernetes clusters or clusters without MCO), the informer cache sync times out and the controller fails to start.

Similarly, the disconnected environment detection code (IsDisconnectedOCP) would fail if ImageContentSourcePolicy or ImageDigestMirrorSet CRDs don't exist.

Solution:
Implement conditional/dynamic CRD watching for MachineConfig:

  1. Remove the MachineConfig watch from SetupWithManager() so the controller can start without the CRD being present

  2. Add ensureMachineConfigWatch() function that:

    • Checks if the MachineConfig CRD exists by querying apiextensions.k8s.io/v1/CustomResourceDefinition
    • Dynamically adds the watch using Controller.Watch() if the CRD exists
    • Tracks watched resources to avoid duplicate watch registration
    • Logs an informational message if the CRD is not available
  3. Call ensureMachineConfigWatch() at the start of each reconciliation to attempt setting up the watch when the CRD becomes available

  4. Update IsDisconnectedOCP() to handle missing ICSP/IDMS CRDs gracefully instead of returning an error

  5. Update inventory generation to log a warning instead of failing when MachineConfig is not available in a disconnected environment

This allows the operator to:

  • Start successfully even without the MachineConfig CRD
  • Work on non-OpenShift Kubernetes clusters
  • Gracefully degrade disconnected environment support
  • Automatically enable MachineConfig watching if the CRD becomes available later

The MachineConfig watch is used for disconnected/mirrored environments to detect changes to the 99-master-generated-registries MachineConfig and propagate registry configuration to dataplane nodes. This feature is now optional rather than required.

Manual registry configuration without MCO:
If the MachineConfig CRD is not available but you need to configure container registries on dataplane nodes, you can set the ansible variables directly in the OpenStackDataPlaneNodeSet spec:

  spec:
    nodeTemplate:
      ansible:
        ansibleVars:
          edpm_podman_disconnected_ocp: true
          edpm_podman_registries_conf: |
            unqualified-search-registries = ["registry.access.redhat.com"]
            [[registry]]
              prefix = ""
              location = "quay.io/openstack-k8s-operators"
              [[registry.mirror]]
                location = "my-registry.example.com/openstack-k8s-operators"

Fixes: cache sync timeout when MachineConfig CRD is not installed

Related: OSPRH-24026

@dprince
Copy link
Contributor

dprince commented Jan 5, 2026

approach here seems reasonable to me. will let others weigh in

@stuggi
Copy link
Contributor Author

stuggi commented Jan 7, 2026

/retest

@stuggi stuggi requested a review from slagle January 7, 2026 07:54
Copy link
Contributor

@abays abays left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 7, 2026
@abays
Copy link
Contributor

abays commented Jan 7, 2026

Deferring to @slagle for final approval

Copy link
Contributor

@slagle slagle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, no problems with the change.
I have an inline comment for potential improvement (assuming I'm reading it correctly).

helper.GetLogger().Info("Disconnected environment detected but MachineConfig not available. "+
"Registry configuration will not be propagated to dataplane nodes. "+
"You may need to configure registries.conf manually.",
"error", err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably ok for now, but I think we should file an issue to:

if isDisconnected and <<MachineConfig CRD doesn't exist>>; then
  error and fail reconcile

Because I think it's pretty likely that this logged error goes completely unnoticed. Previously we failed hard if there was an error, now we just log and continue.

Should isNoMatchError be used to check if the error was actually a MachineConfig CRD not found error?

@openshift-ci openshift-ci bot removed the lgtm label Jan 8, 2026
Comment on lines +151 to +160
if util.IsNoMatchError(err) {
helper.GetLogger().Info("Disconnected environment detected but MachineConfig CRD not available. "+
"Registry configuration will not be propagated to dataplane nodes. "+
"You may need to configure registries.conf manually using ansibleVars "+
"(edpm_podman_disconnected_ocp and edpm_podman_registries_conf).",
"error", err.Error())
} else {
// CRD exists but resource not found, or other errors (network issues,
// permissions, etc.) - return the error. If MCO is installed but the
// registry MachineConfig doesn't exist, this indicates a misconfiguration.
return "", fmt.Errorf("failed to get MachineConfig registry configuration: %w", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slagle iiuc this is what you meant, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, looks good

stuggi and others added 2 commits January 8, 2026 09:21
Problem:
When deploying the openstack-operator on clusters where the OpenShift
Machine Config Operator (MCO) is not installed, the
openstackdataplanenodeset controller fails to start with the error:

  ERROR setup problem running manager {"error": "failed to wait for
  openstackdataplanenodeset caches to sync: timed out waiting for cache
  to be synced for Kind *v1.MachineConfig"}

This occurs because the controller unconditionally sets up a watch for
MachineConfig resources in SetupWithManager(). When the MachineConfig
CRD doesn't exist (e.g., on non-OpenShift Kubernetes clusters or
clusters without MCO), the informer cache sync times out and the
controller fails to start.

Similarly, the disconnected environment detection code
(IsDisconnectedOCP) would fail if ImageContentSourcePolicy or
ImageDigestMirrorSet CRDs don't exist.

Solution:
Implement conditional/dynamic CRD watching for MachineConfig:

1. Remove the MachineConfig watch from SetupWithManager() so the
   controller can start without the CRD being present

2. Add ensureMachineConfigWatch() function that:
   - Checks if the MachineConfig CRD exists by querying
     apiextensions.k8s.io/v1/CustomResourceDefinition
   - Dynamically adds the watch using Controller.Watch() if the CRD
     exists
   - Tracks watched resources to avoid duplicate watch registration
   - Logs an informational message if the CRD is not available

3. Call ensureMachineConfigWatch() at the start of each reconciliation
   to attempt setting up the watch when the CRD becomes available

4. Update IsDisconnectedOCP() to handle missing ICSP/IDMS CRDs
   gracefully instead of returning an error

5. Update inventory generation error handling to distinguish between:
   - IsNoMatchError (CRD not installed): Log warning and continue.
     This is expected on non-OpenShift clusters or clusters without MCO.
   - IsNotFound (CRD exists but resource missing): Return error.
     If MCO is installed and a disconnected environment is detected,
     the registry MachineConfig should exist. Missing resource indicates
     misconfiguration.

This allows the operator to:
- Start successfully even without the MachineConfig CRD
- Work on non-OpenShift Kubernetes clusters
- Gracefully degrade disconnected environment support
- Automatically enable MachineConfig watching if the CRD becomes
  available later
- Report actual misconfigurations when MCO is present but registry
  MachineConfig is missing

The MachineConfig watch is used for disconnected/mirrored environments
to detect changes to the 99-master-generated-registries MachineConfig
and propagate registry configuration to dataplane nodes. This feature
is now optional rather than required.

Manual registry configuration without MCO:
If the MachineConfig CRD is not available but you need to configure
container registries on dataplane nodes, you can set the ansible
variables directly in the OpenStackDataPlaneNodeSet spec:

  spec:
    nodeTemplate:
      ansible:
        ansibleVars:
          edpm_podman_disconnected_ocp: true
          edpm_podman_registries_conf: |
            unqualified-search-registries = ["registry.access.redhat.com"]
            [[registry]]
              prefix = ""
              location = "quay.io/openstack-k8s-operators"
              [[registry.mirror]]
                location = "my-registry.example.com/openstack-k8s-operators"

Fixes: cache sync timeout when MachineConfig CRD is not installed

Related: OSPRH-24026

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Martin Schuppert <mschuppert@redhat.com>
Add comprehensive unit tests for the disconnected environment detection
and registry configuration retrieval functions in the util package.

Tests cover:
- IsNoMatchError: Detects both 'no matches for kind' (real API server) and
  'no kind is registered' (fake client) errors
- IsDisconnectedOCP: Tests with ICSP, IDMS, both, empty lists, and
  missing CRDs (graceful degradation)
- GetMCRegistryConf: Tests successful retrieval, missing MachineConfig,
  missing CRD, invalid ignition format, missing base64 prefix, and
  invalid base64 content
- Full scenario tests for disconnected environments and non-OpenShift
  cluster graceful degradation

The IsNoMatchError function is updated to also detect 'no kind is registered'
errors which occur in functional tests where the fake client behaves
differently than a real Kubernetes API server.

The MachineConfigNotFound test verifies that IsNotFound errors are NOT
treated as IsNoMatchError - when the CRD exists but the resource doesn't,
this should be an error (not a warning), indicating misconfiguration.
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/42ded77318af4a1db77f49124441f538

openstack-k8s-operators-content-provider FAILURE in 10m 00s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ openstack-operator-tempest-multinode SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@stuggi
Copy link
Contributor Author

stuggi commented Jan 8, 2026

/retest

Comment on lines +151 to +160
if util.IsNoMatchError(err) {
helper.GetLogger().Info("Disconnected environment detected but MachineConfig CRD not available. "+
"Registry configuration will not be propagated to dataplane nodes. "+
"You may need to configure registries.conf manually using ansibleVars "+
"(edpm_podman_disconnected_ocp and edpm_podman_registries_conf).",
"error", err.Error())
} else {
// CRD exists but resource not found, or other errors (network issues,
// permissions, etc.) - return the error. If MCO is installed but the
// registry MachineConfig doesn't exist, this indicates a misconfiguration.
return "", fmt.Errorf("failed to get MachineConfig registry configuration: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, looks good

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slagle, stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit c92ad83 into openstack-k8s-operators:main Jan 8, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants