feat: create clusters service control plane in backend#4821
feat: create clusters service control plane in backend#4821
Conversation
Pass the tenant ID value instead of requiring client request headers. This function will soon be called from the RP backend where client request headers are not available.
Just to ensure consistency with what the backend will be using. The backend won't have access to the X-Ms-Home-Tenant-Id client request header.
Move the synchronous PostCluster call out of the frontend's ARM PUT handler and into a new ClusterServiceCreateController. The frontend now stores the cluster in Cosmos without a ClusterServiceID and returns immediately.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: JakobGray The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Skipping CI for Draft Pull Request. |
|
|
||
| // NewClusterServiceCreateController creates a controller that registers clusters | ||
| // with Cluster Service once their desired control plane version is computed. | ||
| func NewClusterServiceCreateController( |
There was a problem hiding this comment.
Probably want to clarify that this is for cluster creation. I assume nested resource types like node pools will have separate controllers from this one.
| func NewClusterServiceCreateController( | |
| func NewClusterServiceCreateClusterController( |
This also bumps it for 4.19 as it is still around. This is interim bump until Azure#4821 is merged that will allow us to automatically pick the latest
| func (f *Frontend) updateHCPClusterInCosmos(ctx context.Context, writer http.ResponseWriter, request *http.Request, httpStatusCode int, newInternalCluster, oldInternalCluster *api.HCPOpenShiftCluster) error { | ||
| logger := utils.LoggerFromContext(ctx) | ||
|
|
||
| if oldInternalCluster.ServiceProviderProperties.ClusterServiceID.String() == "" { |
| http.StatusConflict, | ||
| arm.CloudErrorCodeConflict, | ||
| oldInternalCluster.ID.String(), | ||
| "The cluster is still being registered with Cluster Service. Please retry shortly.") |
There was a problem hiding this comment.
How can this happen i.e the update to be triggered before the cluster reached terminal state
| // cluster is returned as-is from Cosmos without CS-only fields. | ||
| // TODO remove the header it takes and collapse that to some general error handling. | ||
| func (f *Frontend) readInternalClusterFromClusterService(ctx context.Context, oldInternalCluster *api.HCPOpenShiftCluster) (*api.HCPOpenShiftCluster, error) { | ||
| if oldInternalCluster.ServiceProviderProperties.ClusterServiceID.String() == "" { |
| ClusterServiceProvisionShard string | ||
| ClusterServiceNoopProvision bool | ||
| ClusterServiceNoopDeprovision bool |
There was a problem hiding this comment.
Are these things we should continue to honor? I think we are way past that "noop" phase now that we've everything wired up
| // operation is initially created without an InternalID. Look it up | ||
| // from the cluster document's ClusterServiceID. | ||
| internalID := operation.InternalID | ||
| if internalID.String() == "" { |
| return utils.TrackError(fmt.Errorf("failed to get cluster: %w", err)) | ||
| } | ||
| internalID = cluster.ServiceProviderProperties.ClusterServiceID | ||
| if internalID.String() == "" { |
| ret = append(ret, cluster) | ||
| existingCluster, exists := clusterServiceIDToCluster[cluster.ServiceProviderProperties.ClusterServiceID.String()] | ||
| csID := cluster.ServiceProviderProperties.ClusterServiceID.String() | ||
| if csID == "" { |
| for _, cluster := range allHCPClusters.Items(ctx) { | ||
| ret = append(ret, cluster) | ||
| existingCluster, exists := clusterServiceIDToCluster[cluster.ServiceProviderProperties.ClusterServiceID.String()] | ||
| csID := cluster.ServiceProviderProperties.ClusterServiceID.String() |
There was a problem hiding this comment.
| csID := cluster.ServiceProviderProperties.ClusterServiceID.String() | |
| clusterServiceID := cluster.ServiceProviderProperties.ClusterServiceID.String() |
| } | ||
|
|
||
| // Skip clusters that don't have a ClusterServiceID yet (CS creation pending). | ||
| if cosmosCluster.ServiceProviderProperties.ClusterServiceID.String() == "" { |
| provisionShard string | ||
| noopProvision bool | ||
| noopDeprovision bool |
There was a problem hiding this comment.
I don't think we need these anymore; we are now able to create clusters end to end and I don't see the merit of continue to carry this NOOP provisioning/deprovisioning logic
There was a problem hiding this comment.
+1 to dropping this baggage.
| // shared default UUID. Cincinnati's upgrade graph is deterministic regardless of | ||
| // UUID so this is safe for initial version computation before CS creation. | ||
| var clusterUUID uuid.UUID | ||
| if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" { |
There was a problem hiding this comment.
| if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" { | |
| if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 { |
| // shared default UUID. Cincinnati's upgrade graph is deterministic regardless of | ||
| // UUID so this is safe for initial version computation before CS creation. | ||
| var clusterUUID uuid.UUID | ||
| if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" { |
There was a problem hiding this comment.
let's also do the same here
- adjusting the logic there to;
- not trigger the upgrade if CS id isn't set
- not trigger the upgrade if CS' version == desired version (not strictly needed but we can do it to avoid to avoid creating a policy for nothing)
| return utils.TrackError(fmt.Errorf("failed to get Cluster: %w", err)) | ||
| } | ||
|
|
||
| if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" { |
There was a problem hiding this comment.
| if existingCluster.ServiceProviderProperties.ClusterServiceID.String() != "" { | |
| if len(existingCluster.ServiceProviderProperties.ClusterServiceID.String()) > 0 { |
| // Search for an existing CS cluster that matches this Azure resource. | ||
| // This handles the case where CS creation succeeded but we failed to | ||
| // persist the CS ID in Cosmos. | ||
| csCluster, err := c.findExistingCSCluster(ctx, existingCluster) |
There was a problem hiding this comment.
| csCluster, err := c.findExistingCSCluster(ctx, existingCluster) | |
| existingClusterServiceCluster, err := c.findExistingCSCluster(ctx, existingCluster) |
| return nil | ||
| } | ||
|
|
||
| func (c *clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) { |
There was a problem hiding this comment.
| func (c *clusterServiceCreateSyncer) findExistingCSCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) { | |
| func (c *clusterServiceCreateSyncer) findExistingClusterServiceCluster(ctx context.Context, cluster *api.HCPOpenShiftCluster) (*arohcpv1alpha1.Cluster, error) { |
| var tenantID string | ||
| if subscription.Properties != nil && subscription.Properties.TenantId != nil { | ||
| tenantID = *subscription.Properties.TenantId | ||
| } |
There was a problem hiding this comment.
@mbarnes don't we have the tenantId stored somewhere in cosmos already?
| initialProperties := map[string]string{} | ||
| if c.provisionShard != "" { | ||
| initialProperties[ocm.CSPropertyProvisionShardID] = c.provisionShard | ||
| } | ||
| if c.noopProvision { | ||
| initialProperties[ocm.CSPropertyNoopProvision] = ocm.CSPropertyEnabled | ||
| } | ||
| if c.noopDeprovision { | ||
| initialProperties[ocm.CSPropertyNoopDeprovision] = ocm.CSPropertyEnabled | ||
| } |
There was a problem hiding this comment.
In my opinion, we don't need to carry these and we can remove them
There was a problem hiding this comment.
@mbarnes isn't the tenantID stored already? Can we store it? Or we safe to assume that we'll have it from the subscription always?
|
|
||
| csClusterBuilder, csAutoscalerBuilder, err := ocm.BuildCSCluster( | ||
| clusterCopy.ID, tenantID, &clusterCopy, initialProperties, nil, | ||
| ) |
There was a problem hiding this comment.
I wonder if it'll have been less confusing to pass the desired x.y.z version as a parameter as well?
machi1990
left a comment
There was a problem hiding this comment.
Did an initial review, left some comments.
machi1990
left a comment
There was a problem hiding this comment.
@JakobGray I see that some changes to allow the cluster id to be missing are in here; let's sync those with the changes in #4752 as well
There was a problem hiding this comment.
In order for the frontend and backend images to stay compatible with a +/-1 version skew, this probably needs to be split into multiple pull requests.
Consider if we introduce this as is but the frontend and backend images don't get deployed simultaneously for some reason. We could potentially be in a situation where neither the frontend nor backend pods are making the CS call for cluster creation.
The first pull request should introduce the new backend controller but leave in place the CS call in the frontend. So the new controller will initially be dormant.
Once that's fully deployed, a second pull request can remove the CS call in the frontend, at which point the backend controller will take over.
|
@JakobGray: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What
Create clusters service control plane cluster with backend controller
Why
The clusters service control plane is currently created synchronously during the create flow and uses hard coded versions derived from the customer desired version. By moving to an asynchronous approach we reduce create time and have clusters service deployment managed in the background. This will also allow us time to lookup the desired version from Cincinnati.
Testing
Special notes for your reviewer
Depends on #4752 to allow missing cluster service cluster ID