feat: create clusters service control plane in backend by JakobGray · Pull Request #4821 · Azure/ARO-HCP

JakobGray · 2026-04-09T14:32:23Z

What

Create clusters service control plane cluster with backend controller

Adds backend controller for clusters service control plane create
Creates control plane with computed desired version
Creates cluster with cluster service cluster ID initially missing

Why

The clusters service control plane is currently created synchronously during the create flow and uses hard coded versions derived from the customer desired version. By moving to an asynchronous approach we reduce create time and have clusters service deployment managed in the background. This will also allow us time to lookup the desired version from Cincinnati.

Testing

Special notes for your reviewer

Depends on #4752 to allow missing cluster service cluster ID

Pass the tenant ID value instead of requiring client request headers. This function will soon be called from the RP backend where client request headers are not available.

Just to ensure consistency with what the backend will be using. The backend won't have access to the X-Ms-Home-Tenant-Id client request header.

Move the synchronous PostCluster call out of the frontend's ARM PUT handler and into a new ClusterServiceCreateController. The frontend now stores the cluster in Cosmos without a ClusterServiceID and returns immediately.

openshift-ci · 2026-04-09T14:32:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: JakobGray
Once this PR has been reviewed and has the lgtm label, please assign mbarnes for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-04-09T14:32:33Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci · 2026-04-09T14:32:35Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

mbarnes · 2026-04-10T12:44:52Z

+
+// NewClusterServiceCreateController creates a controller that registers clusters
+// with Cluster Service once their desired control plane version is computed.
+func NewClusterServiceCreateController(


Probably want to clarify that this is for cluster creation. I assume nested resource types like node pools will have separate controllers from this one.

Suggested change

func NewClusterServiceCreateController(

func NewClusterServiceCreateClusterController(

This also bumps it for 4.19 as it is still around. This is interim bump until Azure#4821 is merged that will allow us to automatically pick the latest

machi1990 · 2026-04-13T11:28:51Z

 func (f *Frontend) updateHCPClusterInCosmos(ctx context.Context, writer http.ResponseWriter, request *http.Request, httpStatusCode int, newInternalCluster, oldInternalCluster *api.HCPOpenShiftCluster) error {
 	logger := utils.LoggerFromContext(ctx)

+	if oldInternalCluster.ServiceProviderProperties.ClusterServiceID.String() == "" {


len(s) == 0

machi1990 · 2026-04-13T11:29:49Z

+			http.StatusConflict,
+			arm.CloudErrorCodeConflict,
+			oldInternalCluster.ID.String(),
+			"The cluster is still being registered with Cluster Service. Please retry shortly.")


How can this happen i.e the update to be triggered before the cluster reached terminal state

machi1990 · 2026-04-13T11:34:34Z

+// cluster is returned as-is from Cosmos without CS-only fields.
 // TODO remove the header it takes and collapse that to some general error handling.
 func (f *Frontend) readInternalClusterFromClusterService(ctx context.Context, oldInternalCluster *api.HCPOpenShiftCluster) (*api.HCPOpenShiftCluster, error) {
+	if oldInternalCluster.ServiceProviderProperties.ClusterServiceID.String() == "" {


machi1990 · 2026-04-13T11:37:23Z

+	ClusterServiceProvisionShard       string
+	ClusterServiceNoopProvision        bool
+	ClusterServiceNoopDeprovision      bool


Are these things we should continue to honor? I think we are way past that "noop" phase now that we've everything wired up

machi1990 · 2026-04-13T11:38:01Z

+	// operation is initially created without an InternalID. Look it up
+	// from the cluster document's ClusterServiceID.
+	internalID := operation.InternalID
+	if internalID.String() == "" {


machi1990 · 2026-04-13T11:38:09Z

+			return utils.TrackError(fmt.Errorf("failed to get cluster: %w", err))
+		}
+		internalID = cluster.ServiceProviderProperties.ClusterServiceID
+		if internalID.String() == "" {


machi1990 · 2026-04-13T11:38:35Z

 			ret = append(ret, cluster)
-			existingCluster, exists := clusterServiceIDToCluster[cluster.ServiceProviderProperties.ClusterServiceID.String()]
+			csID := cluster.ServiceProviderProperties.ClusterServiceID.String()
+			if csID == "" {


machi1990 · 2026-04-13T11:38:47Z

 		for _, cluster := range allHCPClusters.Items(ctx) {
 			ret = append(ret, cluster)
-			existingCluster, exists := clusterServiceIDToCluster[cluster.ServiceProviderProperties.ClusterServiceID.String()]
+			csID := cluster.ServiceProviderProperties.ClusterServiceID.String()


Suggested change

csID := cluster.ServiceProviderProperties.ClusterServiceID.String()

clusterServiceID := cluster.ServiceProviderProperties.ClusterServiceID.String()

machi1990 · 2026-04-13T11:39:03Z

 	}

+	// Skip clusters that don't have a ClusterServiceID yet (CS creation pending).
+	if cosmosCluster.ServiceProviderProperties.ClusterServiceID.String() == "" {


len(s)==0

throughout

machi1990 · 2026-04-13T11:42:18Z

+	provisionShard   string
+	noopProvision    bool
+	noopDeprovision  bool


I don't think we need these anymore; we are now able to create clusters end to end and I don't see the merit of continue to carry this NOOP provisioning/deprovisioning logic

+1 to dropping this baggage.

machi1990 · 2026-04-13T11:43:43Z

+	// shared default UUID. Cincinnati's upgrade graph is deterministic regardless of
+	// UUID so this is safe for initial version computation before CS creation.
+	var clusterUUID uuid.UUID
+	if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {


Suggested change

if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {

if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 {

machi1990 · 2026-04-13T11:45:06Z

+	// shared default UUID. Cincinnati's upgrade graph is deterministic regardless of
+	// UUID so this is safe for initial version computation before CS creation.
+	var clusterUUID uuid.UUID
+	if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {


let's also do the same here

ARO-HCP/backend/pkg/controllers/upgradecontrollers/trigger_control_plane_upgrade_controller.go

Line 86 in c331d21

func (c *triggerControlPlaneUpgradeSyncer) SyncOnce(ctx context.Context, key controllerutils.HCPClusterKey) error {

adjusting the logic there to;

not trigger the upgrade if CS id isn't set

not trigger the upgrade if CS' version == desired version (not strictly needed but we can do it to avoid to avoid creating a policy for nothing)

machi1990 · 2026-04-13T11:45:43Z

+		return utils.TrackError(fmt.Errorf("failed to get Cluster: %w", err))
+	}
+
+	if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {


Suggested change

if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {

if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 {

machi1990 · 2026-04-13T11:46:15Z

+	// Search for an existing CS cluster that matches this Azure resource.
+	// This handles the case where CS creation succeeded but we failed to
+	// persist the CS ID in Cosmos.
+	csCluster, err := c.findExistingCSCluster(ctx, existingCluster)


Suggested change

csCluster, err := c.findExistingCSCluster(ctx, existingCluster)

existingClusterServiceCluster, err := c.findExistingCSCluster(ctx, existingCluster)

machi1990 · 2026-04-13T11:46:29Z

+	return nil
+}
+
+func (c *clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {


Suggested change

func (c *clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {

func (c *clusterServiceCreateSyncer) findExistingClusterServiceCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {

machi1990 · 2026-04-13T11:47:28Z

+	var tenantID string
+	if subscription.Properties != nil && subscription.Properties.TenantId != nil {
+		tenantID = *subscription.Properties.TenantId
+	}


@mbarnes don't we have the tenantId stored somewhere in cosmos already?

machi1990 · 2026-04-13T11:48:28Z

+	initialProperties := map[string]string{}
+	if c.provisionShard != "" {
+		initialProperties[ocm.CSPropertyProvisionShardID] = c.provisionShard
+	}
+	if c.noopProvision {
+		initialProperties[ocm.CSPropertyNoopProvision] = ocm.CSPropertyEnabled
+	}
+	if c.noopDeprovision {
+		initialProperties[ocm.CSPropertyNoopDeprovision] = ocm.CSPropertyEnabled
+	}


In my opinion, we don't need to carry these and we can remove them

machi1990 · 2026-04-13T11:51:29Z

@mbarnes isn't the tenantID stored already? Can we store it? Or we safe to assume that we'll have it from the subscription always?

machi1990 · 2026-04-13T12:07:51Z

+
+	csClusterBuilder, csAutoscalerBuilder, err := ocm.BuildCSCluster(
+		clusterCopy.ID, tenantID, &clusterCopy, initialProperties, nil,
+	)


I wonder if it'll have been less confusing to pass the desired x.y.z version as a parameter as well?

machi1990

Did an initial review, left some comments.

machi1990

@JakobGray I see that some changes to allow the cluster id to be missing are in here; let's sync those with the changes in #4752 as well

mbarnes

In order for the frontend and backend images to stay compatible with a +/-1 version skew, this probably needs to be split into multiple pull requests.

Consider if we introduce this as is but the frontend and backend images don't get deployed simultaneously for some reason. We could potentially be in a situation where neither the frontend nor backend pods are making the CS call for cluster creation.

The first pull request should introduce the new backend controller but leave in place the CS call in the frontend. So the new controller will initially be dormant.

Once that's fully deployed, a second pull request can remove the CS call in the frontend, at which point the backend controller will take over.

openshift-ci · 2026-04-16T21:33:13Z

@JakobGray: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/cspr	`c331d21`	link	true	`/test cspr`
ci/prow/images-push	`c331d21`	link	true	`/test images-push`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Matthew Barnes and others added 4 commits April 8, 2026 22:21

ocm: Avoid using request headers in BuildCSCluster

0a6846d

Pass the tenant ID value instead of requiring client request headers. This function will soon be called from the RP backend where client request headers are not available.

frontend: Prefer the tenant ID from subscription data

ed202be

Just to ensure consistency with what the backend will be using. The backend won't have access to the X-Ms-Home-Tenant-Id client request header.

backend: Defer Cluster Service creation to a backend controller

fd78fe1

Move the synchronous PostCluster call out of the frontend's ARM PUT handler and into a new ClusterServiceCreateController. The frontend now stores the cluster in Cosmos without a ClusterServiceID and returns immediately.

Use resolved desired version when creating cluster in clusters service

c331d21

openshift-ci bot added the do-not-merge/work-in-progress label Apr 9, 2026

openshift-ci bot added the needs-rebase label Apr 9, 2026

mbarnes reviewed Apr 10, 2026

View reviewed changes

machi1990 mentioned this pull request Apr 13, 2026

feat: assign cluster version on control plane creation #4477

Closed

machi1990 mentioned this pull request Apr 13, 2026

chore: bump cluster install versions to latest for 4.20, 4.21 #4852

Merged

machi1990 reviewed Apr 13, 2026

View reviewed changes

machi1990 requested changes Apr 13, 2026

View reviewed changes

mbarnes reviewed Apr 16, 2026

View reviewed changes

	func NewClusterServiceCreateController(
	func NewClusterServiceCreateClusterController(

	csID := cluster.ServiceProviderProperties.ClusterServiceID.String()
	clusterServiceID := cluster.ServiceProviderProperties.ClusterServiceID.String()

	if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {
	if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 {

	csCluster, err := c.findExistingCSCluster(ctx, existingCluster)
	existingClusterServiceCluster, err := c.findExistingCSCluster(ctx, existingCluster)

	func (c clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {
	func (c clusterServiceCreateSyncer) findExistingClusterServiceCluster(ctx context.Context, cluster api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {

Conversation

JakobGray commented Apr 9, 2026

What

Why

Testing

Special notes for your reviewer

Uh oh!

openshift-ci bot commented Apr 9, 2026

Uh oh!

openshift-ci bot commented Apr 9, 2026

Uh oh!

openshift-ci bot commented Apr 9, 2026

Uh oh!

mbarnes Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

machi1990 left a comment

Choose a reason for hiding this comment

Uh oh!

machi1990 left a comment

Choose a reason for hiding this comment

Uh oh!

mbarnes left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbarnes Apr 10, 2026 •

edited

Loading

mbarnes left a comment •

edited

Loading