Skip to content

feat: create clusters service control plane in backend#4821

Draft
JakobGray wants to merge 4 commits intomainfrom
jagray/ARO-24824-async-install
Draft

feat: create clusters service control plane in backend#4821
JakobGray wants to merge 4 commits intomainfrom
jagray/ARO-24824-async-install

Conversation

@JakobGray
Copy link
Copy Markdown
Collaborator

What

Create clusters service control plane cluster with backend controller

  • Adds backend controller for clusters service control plane create
  • Creates control plane with computed desired version
  • Creates cluster with cluster service cluster ID initially missing

Why

The clusters service control plane is currently created synchronously during the create flow and uses hard coded versions derived from the customer desired version. By moving to an asynchronous approach we reduce create time and have clusters service deployment managed in the background. This will also allow us time to lookup the desired version from Cincinnati.

Testing

Special notes for your reviewer

Depends on #4752 to allow missing cluster service cluster ID

Matthew Barnes and others added 4 commits April 8, 2026 22:21
Pass the tenant ID value instead of requiring client request
headers. This function will soon be called from the RP backend
where client request headers are not available.
Just to ensure consistency with what the backend will be using.
The backend won't have access to the X-Ms-Home-Tenant-Id client
request header.
Move the synchronous PostCluster call out of the frontend's ARM PUT
handler and into a new ClusterServiceCreateController. The frontend
now stores the cluster in Cosmos without a ClusterServiceID and returns
immediately.
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: JakobGray
Once this PR has been reviewed and has the lgtm label, please assign mbarnes for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 9, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all


// NewClusterServiceCreateController creates a controller that registers clusters
// with Cluster Service once their desired control plane version is computed.
func NewClusterServiceCreateController(
Copy link
Copy Markdown
Collaborator

@mbarnes mbarnes Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want to clarify that this is for cluster creation. I assume nested resource types like node pools will have separate controllers from this one.

Suggested change
func NewClusterServiceCreateController(
func NewClusterServiceCreateClusterController(

machi1990 added a commit to machi1990/ARO-HCP that referenced this pull request Apr 13, 2026
This also bumps it for 4.19 as it is still around.

This is interim bump until Azure#4821 is merged that will allow us to automatically pick the latest
func (f *Frontend) updateHCPClusterInCosmos(ctx context.Context, writer http.ResponseWriter, request *http.Request, httpStatusCode int, newInternalCluster, oldInternalCluster *api.HCPOpenShiftCluster) error {
logger := utils.LoggerFromContext(ctx)

if oldInternalCluster.ServiceProviderProperties.ClusterServiceID.String() == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(s) == 0

http.StatusConflict,
arm.CloudErrorCodeConflict,
oldInternalCluster.ID.String(),
"The cluster is still being registered with Cluster Service. Please retry shortly.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this happen i.e the update to be triggered before the cluster reached terminal state

// cluster is returned as-is from Cosmos without CS-only fields.
// TODO remove the header it takes and collapse that to some general error handling.
func (f *Frontend) readInternalClusterFromClusterService(ctx context.Context, oldInternalCluster *api.HCPOpenShiftCluster) (*api.HCPOpenShiftCluster, error) {
if oldInternalCluster.ServiceProviderProperties.ClusterServiceID.String() == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(s)==0

Comment on lines +78 to +80
ClusterServiceProvisionShard string
ClusterServiceNoopProvision bool
ClusterServiceNoopDeprovision bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these things we should continue to honor? I think we are way past that "noop" phase now that we've everything wired up

// operation is initially created without an InternalID. Look it up
// from the cluster document's ClusterServiceID.
internalID := operation.InternalID
if internalID.String() == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(s)==0

return utils.TrackError(fmt.Errorf("failed to get cluster: %w", err))
}
internalID = cluster.ServiceProviderProperties.ClusterServiceID
if internalID.String() == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(s)==0

ret = append(ret, cluster)
existingCluster, exists := clusterServiceIDToCluster[cluster.ServiceProviderProperties.ClusterServiceID.String()]
csID := cluster.ServiceProviderProperties.ClusterServiceID.String()
if csID == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(s)==0

for _, cluster := range allHCPClusters.Items(ctx) {
ret = append(ret, cluster)
existingCluster, exists := clusterServiceIDToCluster[cluster.ServiceProviderProperties.ClusterServiceID.String()]
csID := cluster.ServiceProviderProperties.ClusterServiceID.String()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
csID := cluster.ServiceProviderProperties.ClusterServiceID.String()
clusterServiceID := cluster.ServiceProviderProperties.ClusterServiceID.String()

}

// Skip clusters that don't have a ClusterServiceID yet (CS creation pending).
if cosmosCluster.ServiceProviderProperties.ClusterServiceID.String() == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(s)==0

throughout

Comment on lines +43 to +45
provisionShard string
noopProvision bool
noopDeprovision bool
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these anymore; we are now able to create clusters end to end and I don't see the merit of continue to carry this NOOP provisioning/deprovisioning logic

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to dropping this baggage.

// shared default UUID. Cincinnati's upgrade graph is deterministic regardless of
// UUID so this is safe for initial version computation before CS creation.
var clusterUUID uuid.UUID
if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {
if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 {

// shared default UUID. Cincinnati's upgrade graph is deterministic regardless of
// UUID so this is safe for initial version computation before CS creation.
var clusterUUID uuid.UUID
if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also do the same here

func (c *triggerControlPlaneUpgradeSyncer) SyncOnce(ctx context.Context, key controllerutils.HCPClusterKey) error {

  • adjusting the logic there to;
  • not trigger the upgrade if CS id isn't set
  • not trigger the upgrade if CS' version == desired version (not strictly needed but we can do it to avoid to avoid creating a policy for nothing)

return utils.TrackError(fmt.Errorf("failed to get Cluster: %w", err))
}

if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" {
if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 {

// Search for an existing CS cluster that matches this Azure resource.
// This handles the case where CS creation succeeded but we failed to
// persist the CS ID in Cosmos.
csCluster, err := c.findExistingCSCluster(ctx, existingCluster)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
csCluster, err := c.findExistingCSCluster(ctx, existingCluster)
existingClusterServiceCluster, err := c.findExistingCSCluster(ctx, existingCluster)

return nil
}

func (c *clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {
func (c *clusterServiceCreateSyncer) findExistingClusterServiceCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) {

Comment on lines +167 to +170
var tenantID string
if subscription.Properties != nil && subscription.Properties.TenantId != nil {
tenantID = *subscription.Properties.TenantId
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbarnes don't we have the tenantId stored somewhere in cosmos already?

Comment on lines +172 to +181
initialProperties := map[string]string{}
if c.provisionShard != "" {
initialProperties[ocm.CSPropertyProvisionShardID] = c.provisionShard
}
if c.noopProvision {
initialProperties[ocm.CSPropertyNoopProvision] = ocm.CSPropertyEnabled
}
if c.noopDeprovision {
initialProperties[ocm.CSPropertyNoopDeprovision] = ocm.CSPropertyEnabled
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we don't need to carry these and we can remove them

Comment thread internal/ocm/convert.go
Comment on lines 802 to 806
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbarnes isn't the tenantID stored already? Can we store it? Or we safe to assume that we'll have it from the subscription always?


csClusterBuilder, csAutoscalerBuilder, err := ocm.BuildCSCluster(
clusterCopy.ID, tenantID, &clusterCopy, initialProperties, nil,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'll have been less confusing to pass the desired x.y.z version as a parameter as well?

Copy link
Copy Markdown
Collaborator

@machi1990 machi1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review, left some comments.

Copy link
Copy Markdown
Collaborator

@machi1990 machi1990 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JakobGray I see that some changes to allow the cluster id to be missing are in here; let's sync those with the changes in #4752 as well

Copy link
Copy Markdown
Collaborator

@mbarnes mbarnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order for the frontend and backend images to stay compatible with a +/-1 version skew, this probably needs to be split into multiple pull requests.

Consider if we introduce this as is but the frontend and backend images don't get deployed simultaneously for some reason. We could potentially be in a situation where neither the frontend nor backend pods are making the CS call for cluster creation.

The first pull request should introduce the new backend controller but leave in place the CS call in the frontend. So the new controller will initially be dormant.

Once that's fully deployed, a second pull request can remove the CS call in the frontend, at which point the backend controller will take over.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 16, 2026

@JakobGray: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/cspr c331d21 link true /test cspr
ci/prow/images-push c331d21 link true /test images-push

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants