OCPBUGS-85346: Revert 4.22 and 4.23 from C4A to T2A instances to avoid hyperdisk costs#79056
OCPBUGS-85346: Revert 4.22 and 4.23 from C4A to T2A instances to avoid hyperdisk costs#79056barbacbd wants to merge 1 commit intoopenshift:mainfrom
Conversation
…d hyperdisk costs This reverts 4.22 and 4.23 multi-arch GCP jobs from C4A (Axion) instances back to T2A (Tau) instances to avoid the cost increase associated with hyperdisk-balanced storage. ## Root Cause of StatefulSet Failures C4A instances only support Hyperdisk storage types and do NOT support Persistent Disk types (pd-standard, pd-balanced, pd-ssd). When StatefulSet tests create PVCs using the cluster's default StorageClass (standard-csi with pd-standard), volume attach fails on C4A nodes: ``` googleapi: Error 400: pd-standard disk type cannot be used by c4a-standard-2 machine type., badRequest ``` ## The Fix: Revert to T2A Instead of adding hyperdisk-balanced StorageClass (which increases costs), revert 4.22 and 4.23 to T2A instances which support pd-standard. ### Cost Considerations - **pd-standard:** ~$0.04/GB/month (T2A compatible) - **hyperdisk-balanced:** ~$0.10-0.12/GB/month + IOPS charges (C4A required) For CI jobs with many PVCs (monitoring, logging, registry, StatefulSets), using hyperdisk-balanced across all test runs would significantly increase costs. T2A with pd-standard is more cost-effective. ### Quota Mitigation While reverting to T2A means 4.21, 4.22, and 4.23 will compete for the same T2A quota, PR openshift#77809 included other mitigations that remain in place: 1. **Zone randomization** - Distributes instances across zones 2. **Interval scheduling (168h)** - Prevents simultaneous execution 3. **Smaller instances** - Uses t2a-standard-2 (not standard-4) 4. **Balanced worker layout** - 2+2 workers instead of 3+2 These mitigations should reduce quota pressure even with multiple releases using T2A. ## Changes Made **4.22 nightly config (6 jobs):** - ocp-e2e-gcp-ovn-multi-a-a: c4a-standard-2 → t2a-standard-2 - ocp-e2e-gcp-ovn-multi-x-x-to-a-x: c4a-standard-2 → t2a-standard-2 - ocp-e2e-gcp-ovn-multi-a-a-to-x-a: c4a-standard-2 → t2a-standard-2 - ocp-e2e-upgrade-gcp-ovn-multi-a-a: c4a-standard-2 → t2a-standard-2 - ocp-e2e-gcp-ovn-multi-x-ax: c4a-standard-2 → t2a-standard-2 - ocp-e2e-upgrade-gcp-ovn-multi-x-ax: c4a-standard-2 → t2a-standard-2 **4.23 nightly config (5 jobs):** - ocp-e2e-gcp-ovn-multi-a-a: c4a-standard-2 → t2a-standard-2 - ocp-e2e-gcp-ovn-multi-x-x-to-a-x: c4a-standard-2 → t2a-standard-2 - ocp-e2e-gcp-ovn-multi-a-a-to-x-a: c4a-standard-2 → t2a-standard-2 - ocp-e2e-upgrade-gcp-ovn-multi-a-a: c4a-standard-2 → t2a-standard-2 - Heterogeneous jobs: c4a-standard-2 → t2a-standard-2 **4.22 upgrade configs (2 files):** - nightly-4.22-upgrade-from-nightly-4.21: c4a-standard-4 → t2a-standard-4 - nightly-4.22-upgrade-from-stable-4.21: c4a-standard-4 → t2a-standard-4 **4.23 upgrade configs (2 files):** - nightly-4.23-upgrade-from-nightly-4.22: c4a-standard-2 → t2a-standard-2 - nightly-4.23-upgrade-from-stable-4.22: c4a-standard-2 → t2a-standard-2 Removed `ADDITIONAL_WORKER_DISK_TYPE: hyperdisk-balanced` from all heterogeneous jobs (no longer needed as T2A supports pd-standard). ## Release Distribution After This Change - **4.21:** T2A standard-2 - **4.22:** T2A standard-2 (reverted from C4A) - **4.23:** T2A standard-2 (reverted from C4A) - **5.0:** T2A standard-4 ## References - JIRA: https://redhat.atlassian.net/browse/OCPBUGS-85346 - Failed job: periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-ax - Original PR openshift#77809: openshift#77809 - GCP C4A disk requirements: https://cloud.google.com/blog/products/compute/first-google-axion-processor-c4a-now-ga-with-titanium-ssd
|
@barbacbd: This pull request references Jira Issue OCPBUGS-85346, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: barbacbd The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThis PR updates GCP machine type configurations across six ci-operator multiarch release configurations (versions 4.22 and 4.23). The changes replace ChangesGCP Machine Type Migration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23.yaml (1)
749-760: Track T2A quota pressure after rollout.Given these lane-wide shifts, it’s worth adding/confirming alerting on pending-job spikes and GCP quota exhaustion for T2A pools.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23.yaml` around lines 749 - 760, This job config switches to T2A instances (COMPUTE_NODE_TYPE and CONTROL_PLANE_NODE_TYPE set to t2a-standard-2 with OCP_ARCH: arm64 and workflow: openshift-e2e-gcp-ovn), so add or reference alerting for T2A quota and pending-job spikes: create/update Prometheus alert rules tied to these identifiers (use labels/annotations derived from COMPUTE_NODE_TYPE, CONTROL_PLANE_NODE_TYPE, OCP_ARCH, workflow and the job “as” value) that fire on sustained increases in pending Kubernetes jobs or GCP CPU/quota exhaustion for t2a-standard-2 (also include MIGRATION_CP_MACHINE_TYPE / MIGRATION_INFRA_MACHINE_TYPE where relevant); attach the alerts to the job via metadata/annotations or include an alerts reference so on rollout we get notified of pending-job spikes and T2A quota exhaustion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23.yaml`:
- Around line 749-760: This job config switches to T2A instances
(COMPUTE_NODE_TYPE and CONTROL_PLANE_NODE_TYPE set to t2a-standard-2 with
OCP_ARCH: arm64 and workflow: openshift-e2e-gcp-ovn), so add or reference
alerting for T2A quota and pending-job spikes: create/update Prometheus alert
rules tied to these identifiers (use labels/annotations derived from
COMPUTE_NODE_TYPE, CONTROL_PLANE_NODE_TYPE, OCP_ARCH, workflow and the job “as”
value) that fire on sustained increases in pending Kubernetes jobs or GCP
CPU/quota exhaustion for t2a-standard-2 (also include MIGRATION_CP_MACHINE_TYPE
/ MIGRATION_INFRA_MACHINE_TYPE where relevant); attach the alerts to the job via
metadata/annotations or include an alerts reference so on rollout we get
notified of pending-job spikes and T2A quota exhaustion.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: eb5f3be0-e742-4f7a-9125-38a7baff8ee7
📒 Files selected for processing (6)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22-upgrade-from-nightly-4.21.yamlci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22-upgrade-from-stable-4.21.yamlci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22.yamlci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23-upgrade-from-nightly-4.22.yamlci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23-upgrade-from-stable-4.22.yamlci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23.yaml
|
[REHEARSALNOTIFIER]
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals. Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.23-ocp-e2e-gcp-ovn-multi-a-a-to-x-a |
|
@gnufied: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@barbacbd: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/payload periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-ax |
This reverts 4.22 and 4.23 multi-arch GCP jobs from C4A (Axion) instances back to T2A (Tau) instances to avoid the cost increase associated with hyperdisk-balanced storage.
Root Cause of StatefulSet Failures
C4A instances only support Hyperdisk storage types and do NOT support Persistent Disk types (pd-standard, pd-balanced, pd-ssd). When StatefulSet tests create PVCs using the cluster's default StorageClass (standard-csi with pd-standard), volume attach fails on C4A nodes:
The Fix: Revert to T2A
Instead of adding hyperdisk-balanced StorageClass (which increases costs), revert 4.22 and 4.23 to T2A instances which support pd-standard.
Cost Considerations
For CI jobs with many PVCs (monitoring, logging, registry, StatefulSets), using hyperdisk-balanced across all test runs would significantly increase costs. T2A with pd-standard is more cost-effective.
Quota Mitigation
While reverting to T2A means 4.21, 4.22, and 4.23 will compete for the same T2A quota, PR #77809 included other mitigations that remain in place:
These mitigations should reduce quota pressure even with multiple releases using T2A.
Changes Made
4.22 nightly config (6 jobs):
4.23 nightly config (5 jobs):
4.22 upgrade configs (2 files):
4.23 upgrade configs (2 files):
Removed
ADDITIONAL_WORKER_DISK_TYPE: hyperdisk-balancedfrom all heterogeneous jobs (no longer needed as T2A supports pd-standard).Release Distribution After This Change
References
OpenShift CI Configuration: Revert GCP Multi-Arch Jobs from C4A to T2A Instances
This PR reverts GCP multi-architecture CI job configurations for OpenShift 4.22 and 4.23 from C4A (Axion) instances back to T2A (Tau) instances to reduce storage costs.
Problem Being Addressed
C4A instances require hyperdisk-balanced storage (approximately $0.10–0.12/GB/month plus IOPS costs), while T2A instances support cheaper pd-standard storage (approximately $0.04/GB/month). The C4A instances were incompatible with standard persistent volume storage, causing PVC attachment failures in StatefulSet tests. This change restores compatibility with pd-standard storage while avoiding the higher cost of hyperdisk infrastructure.
Configuration Changes
Five CI configuration files in
ci-operator/config/openshift/multiarch/were updated:openshift-multiarch-main__nightly-4.22.yaml: Updated six jobs from
c4a-standard-2tot2a-standard-2for compute and control plane nodes; removedADDITIONAL_WORKER_DISK_TYPE: hyperdisk-balancedfrom heterogeneous worker configuration.openshift-multiarch-main__nightly-4.22-upgrade-from-nightly-4.21.yaml: Changed additional worker VM type from
c4a-standard-4tot2a-standard-4.openshift-multiarch-main__nightly-4.22-upgrade-from-stable-4.21.yaml: Changed additional worker VM type from
c4a-standard-4tot2a-standard-4and removedADDITIONAL_WORKER_DISK_TYPE.openshift-multiarch-main__nightly-4.23.yaml: Updated five jobs from
c4a-standard-2tot2a-standard-2; removedADDITIONAL_WORKER_DISK_TYPE: hyperdisk-balancedfrom heterogeneous configurations.openshift-multiarch-main__nightly-4.23-upgrade-from-nightly-4.22.yaml and openshift-multiarch-main__nightly-4.23-upgrade-from-stable-4.22.yaml: Changed additional worker VM types from
c4a-standard-2tot2a-standard-2and removed hyperdisk-balanced disk type setting.Affected Jobs
Resource Allocation Impact
This change consolidates versions 4.21, 4.22, and 4.23 to use T2A standard-2 instances, while 5.0 continues with T2A standard-4. Quota management relies on existing mitigations including zone randomization, 168-hour scheduling intervals, and balanced worker distribution (2+2).