Skip to content

ARO-HCP: Introduce slot manager#79042

Draft
roivaz wants to merge 4 commits intoopenshift:mainfrom
roivaz:introduce-slot-manager
Draft

ARO-HCP: Introduce slot manager#79042
roivaz wants to merge 4 commits intoopenshift:mainfrom
roivaz:introduce-slot-manager

Conversation

@roivaz
Copy link
Copy Markdown
Contributor

@roivaz roivaz commented May 8, 2026

https://redhat.atlassian.net/browse/ARO-25833

Summary

This PR introduces a slot manager for ARO-HCP E2E testing workflows, moving lease acquisition from static job configuration to runtime pipeline steps. This enables more flexible slot/environment selection while maintaining compatibility with existing job definitions.

Key Changes

Lease Management Refactoring

  • New dedicated lease steps: Introduces aro-hcp-lease-acquire and aro-hcp-lease-release steps that handle Boskos resource allocation at runtime rather than through static job configuration
  • Common utilities: Adds shared lease helper functions (aro-hcp-lease-common-commands.sh) providing:
    • Slot/location environment export management across pipeline steps
    • Slot-manager execution via either built-in binary or dynamic go run from ARO-HCP repository (controlled by optional ARO_HCP_SLOT_MANAGER_GIT_REF override)
    • Graceful handling of cross-step environment propagation via SHARED_DIR

Workflow Integration

  • Updates aro-hcp-e2e-workflow.yaml and aro-hcp-local-e2e-workflow.yaml to include lease acquire in pre phase and lease release in post phase
  • Removes declarative leases blocks from E2E job configurations (*-e2e-parallel* jobs in main/periodic/e2e variants) while keeping workflow references intact
  • Consolidates e2e-parallel job into single configuration running against ci01 environment instead of maintaining separate prow and ci01 job variants

Resource Pool Transition

  • Replaces quota-slice based resources (aro-hcp-int-quota-slice, aro-hcp-prod-quota-slice, aro-hcp-stg-quota-slice) with slot-based resource pools:
    • aro-hcp-dev-westus3-slot (15 slots)
    • aro-hcp-int-uksouth-slot (1 slot)
    • aro-hcp-prod-uksouth-slot (10 slots)
    • aro-hcp-stg-uksouth-slot (1 slot)
  • Updates Boskos generation script to populate new slot resource definitions

Cross-Subscription Support

  • Refactors gather-snapshot step to support cross-subscription operations:
    • Uses centralized infra-global-subscription-id from cluster profile instead of environment-specific Vault paths
    • Removes environment-specific subscription lookup logic, enabling gather operations across infrastructure subscriptions
  • Clarifies documentation in gather steps that they require "dev cloud" infrastructure access

Test Script Updates

  • Test execution scripts (aro-hcp-test-local-commands.sh, aro-hcp-test-persistent-commands.sh, aro-hcp-write-config-commands.sh) now source lease exports rather than reading subscription/location from static files
  • Adds MSI mock credential validation to provision environment step
  • Centralizes location/slot selection through lease acquisition step rather than parameter overrides

Configuration Changes

  • Removes LOCATION and MULTISTAGE_PARAM_OVERRIDE_LOCATION parameters from write-config step
  • Adds lease-related environment variables and documentation to new lease acquisition/release steps
  • Adds MSI mock resource lease (aro-hcp-msi-mock-cs-sp-dev) to provision environment step

Impact

The refactoring shifts ARO-HCP E2E pipeline control from static job-level lease declarations to dynamic runtime slot management, enabling more flexible environment selection while simplifying job configuration. All changes remain localized to step registry and configuration, maintaining the existing job contracts for downstream consumers.

Note: The PR temporarily references slot-manager from an unmerged ARO-HCP repository commit via the ARO_HCP_SLOT_MANAGER_GIT_REF override capability, with fallback to the built-in binary.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 8, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

Warning

Rate limit exceeded

@roivaz has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 4 minutes and 19 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 26514241-3af0-41f0-a36d-3b2744d0b5ac

📥 Commits

Reviewing files that changed from the base of the PR and between 466f622 and 6e4fc42.

📒 Files selected for processing (25)
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main.yaml
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__e2e.yaml
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic.yaml
  • ci-operator/step-registry/aro-hcp/e2e/aro-hcp-e2e-workflow.yaml
  • ci-operator/step-registry/aro-hcp/lease/acquire/OWNERS
  • ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-commands.sh
  • ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml
  • ci-operator/step-registry/aro-hcp/lease/common/OWNERS
  • ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-commands.sh
  • ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-ref.yaml
  • ci-operator/step-registry/aro-hcp/lease/release/OWNERS
  • ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-commands.sh
  • ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.yaml
  • ci-operator/step-registry/aro-hcp/local-e2e/aro-hcp-local-e2e-workflow.yaml
  • ci-operator/step-registry/aro-hcp/provision/environment/aro-hcp-provision-environment-commands.sh
  • ci-operator/step-registry/aro-hcp/provision/environment/aro-hcp-provision-environment-ref.yaml
  • ci-operator/step-registry/aro-hcp/test/local/aro-hcp-test-local-commands.sh
  • ci-operator/step-registry/aro-hcp/test/persistent/aro-hcp-test-persistent-commands.sh
  • ci-operator/step-registry/aro-hcp/write-config/aro-hcp-write-config-commands.sh
  • ci-operator/step-registry/aro-hcp/write-config/aro-hcp-write-config-ref.yaml
  • core-services/prow/02_config/_boskos.yaml
  • core-services/prow/02_config/generate-boskos.py

Walkthrough

The PR implements an explicit lease acquisition and release workflow for ARO-HCP E2E tests by introducing dedicated lease management steps, transitioning from inline lease declarations to pre/post workflow execution, and updating Boskos to use slot-based resources instead of quota-slice resources.

Changes

Lease Management & Resource Slot Migration

Layer / File(s) Summary
Lease Infrastructure Foundation
ci-operator/step-registry/aro-hcp/lease/common/*
Introduces shared lease command utilities with functions for env file handling, slot-manager checkout, and execution with support for override Git refs.
Lease Acquire/Release Steps
ci-operator/step-registry/aro-hcp/lease/acquire/*, ci-operator/step-registry/aro-hcp/lease/release/*
Creates new step definitions for acquiring and releasing runtime Boskos leases, including command scripts, resource requests, environment variables, and OWNERS files.
Workflow Integration
ci-operator/step-registry/aro-hcp/e2e/aro-hcp-e2e-workflow.yaml, ci-operator/step-registry/aro-hcp/local-e2e/aro-hcp-local-e2e-workflow.yaml
Adds aro-hcp-lease-acquire to workflow pre phase and aro-hcp-lease-release to post-execution steps; removes inline leases declarations.
Subscription & Environment Refactoring
ci-operator/step-registry/aro-hcp/gather/snapshot/*, ci-operator/step-registry/aro-hcp/write-config/aro-hcp-write-config-ref.yaml, ci-operator/step-registry/aro-hcp/test/*
Shifts from environment-specific subscriptions (infra-${ARO_HCP_DEPLOY_ENV}-subscription-id) to global subscription (infra-global-subscription-id); removes MULTISTAGE_PARAM_OVERRIDE_LOCATION in favor of leased slot sourcing.
Boskos Resource Pool Transition
core-services/prow/02_config/_boskos.yaml, core-services/prow/02_config/generate-boskos.py
Introduces new slot-based resource types (aro-hcp-dev-westus3-slot, aro-hcp-int/prod/stg-uksouth-slot); removes deprecated quota-slice resources.
Lease Configuration in Job Specs
ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main.yaml, ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__e2e.yaml, ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic.yaml
Removes inline leases blocks from parallel e2e test jobs; updates e2e-parallel env to use ci01 instead of prow deployment environment.
Provisioning & Step Updates
ci-operator/step-registry/aro-hcp/provision/environment/*
Adds sourcing of lease env exports; introduces MSI mock credential handling via LEASED_MSI_MOCK_SP lease; updates config override generation with MSI mock and hcp recovery image fields.
Gather Step Documentation
ci-operator/step-registry/aro-hcp/gather/observability/aro-hcp-gather-observability-ref.yaml, ci-operator/step-registry/aro-hcp/gather/provision-failure/aro-hcp-gather-provision-failure-ref.yaml
Refines environment variable documentation to clarify "dev cloud" only scope and Vault subscription selection semantics.

Sequence Diagram

sequenceDiagram
    participant Prow
    participant LeaseAcquire as Lease Acquire Step
    participant SlotManager as Slot Manager
    participant Boskos
    participant Provision as Provision/Test Steps
    participant LeaseRelease as Lease Release Step

    Prow->>LeaseAcquire: run (pre-phase)
    LeaseAcquire->>SlotManager: run_slot_manager acquire
    SlotManager->>Boskos: request aro-hcp-{env}-{location}-slot
    Boskos-->>SlotManager: allocate slot resource
    SlotManager->>LeaseAcquire: export LOCATION, CUSTOMER_SUBSCRIPTION
    LeaseAcquire->>LeaseAcquire: write aro-hcp-slot-env.sh to SHARED_DIR
    
    Prow->>Provision: run (main workflow)
    Provision->>Provision: source aro-hcp-slot-env.sh
    Provision->>Provision: use LOCATION, CUSTOMER_SUBSCRIPTION from lease
    Provision->>Provision: provision infrastructure & run tests
    
    Prow->>LeaseRelease: run (post-phase)
    LeaseRelease->>SlotManager: run_slot_manager release
    SlotManager->>Boskos: release slot resource
    Boskos-->>SlotManager: mark resource free
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'ARO-HCP: Introduce slot manager' accurately reflects the primary change: introducing a new slot manager system for ARO-HCP lease acquisition/release.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: roivaz
Once this PR has been reviewed and has the lgtm label, please assign hector-vido for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label May 8, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.yaml (1)

1-20: ⚡ Quick win

Add an explicit timeout for the post release step.

Without a timeout, a stuck release command can prolong post execution unnecessarily. A small explicit cap makes failure mode clearer.

Proposed diff
 ref:
   as: aro-hcp-lease-release
   from: aro-hcp-e2e-tests
   commands: aro-hcp-lease-release-commands.sh
   resources:
     requests:
       cpu: 10m
       memory: 100Mi
+  timeout: 5m
   env:
     - name: ARO_HCP_SLOT_MANAGER_GIT_REF
       default: ""
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.yaml`
around lines 1 - 20, Add an explicit timeout key to the aro-hcp-lease-release
step definition so the post-release command cannot hang indefinitely; modify the
ref block for "as: aro-hcp-lease-release" (the step that runs
"aro-hcp-lease-release-commands.sh") to include a sensible timeout (e.g.
"timeout: 10m" or another small cap appropriate for the release task) so stuck
releases fail fast and clearly.
ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml (1)

36-38: ⚡ Quick win

Use a non-empty default for ARO_HCP_DEPLOY_ENV.

At Line 37, defaulting to "" increases the chance of opaque runtime failures when this env is not explicitly set. A safe fallback like prow is lower-risk.

Proposed diff
     - name: ARO_HCP_DEPLOY_ENV
-      default: ""
+      default: "prow"
       documentation: Config environment name (prow, ci01, int, stg, prod).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml`
around lines 36 - 38, The ARO_HCP_DEPLOY_ENV variable is defaulting to an empty
string which risks runtime failures; update the variable default for
ARO_HCP_DEPLOY_ENV to a safe fallback value (e.g., "prow") instead of "" and
ensure the documentation string remains accurate (Config environment name (prow,
ci01, int, stg, prod)).
ci-operator/step-registry/aro-hcp/e2e/aro-hcp-e2e-workflow.yaml (1)

11-13: ⚡ Quick win

Move lease release to the end of post for safer sequencing.

At Line 11, releasing before gather steps can make future post-step dependencies on lease context fragile. Keeping release last is safer and matches typical cleanup ordering.

Proposed diff
     post:
-      - ref: aro-hcp-lease-release
       - ref: aro-hcp-gather-test-visualization
       - ref: aro-hcp-gather-custom-link-tools
+      - ref: aro-hcp-lease-release
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ci-operator/step-registry/aro-hcp/e2e/aro-hcp-e2e-workflow.yaml` around lines
11 - 13, The post-step currently lists aro-hcp-lease-release before gather steps
(aro-hcp-gather-test-visualization and aro-hcp-gather-custom-link-tools); move
the aro-hcp-lease-release ref to the end of the post sequence so both gather
steps run while the lease is still held and release happens last to ensure safe
cleanup ordering.
ci-operator/step-registry/aro-hcp/gather/snapshot/aro-hcp-gather-snapshot-commands.sh (1)

21-21: 💤 Low value

Consider suppressing az account set output for consistency.

az login on Line 20 uses --output none, but az account set on Line 21 does not. While not harmful, the output can be noisy in CI logs.

🔧 Proposed fix
-az account set --subscription "${GLOBAL_INFRA_SUBSCRIPTION_ID}"
+az account set --subscription "${GLOBAL_INFRA_SUBSCRIPTION_ID}" --output none
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/aro-hcp/gather/snapshot/aro-hcp-gather-snapshot-commands.sh`
at line 21, The az account set command currently prints output and should be
made consistent with az login by suppressing output; update the az account set
invocation (the line with az account set --subscription
"${GLOBAL_INFRA_SUBSCRIPTION_ID}") to include the same output suppression flag
(e.g., --output none) so CI logs remain quiet and consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml`:
- Around line 39-46: The slot-manager override currently accepts arbitrary git
refs; update prepare_slot_manager_checkout() in aro-hcp-lease-common-commands.sh
to validate ARO_HCP_SLOT_MANAGER_GIT_REF before using it: ensure the variable is
non-empty and matches exactly a 40-character hexadecimal SHA (/[0-9a-f]{40}/i);
if it fails validation, print a clear error and exit non-zero, and only proceed
to git fetch and go run when the ref passes this SHA check to guarantee
immutable commit usage for slot-manager.

In
`@ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-commands.sh`:
- Around line 29-40: The prepare function
(aro_hcp_lease::prepare_slot_manager_checkout) currently accepts any git ref and
fetches it directly, which allows mutable refs; change it to require and
validate an immutable 40-character commit SHA (hex) for git_ref before
performing git operations, returning an error if validation fails; apply the
same SHA-only check to the sibling checkout logic in the other block covering
lines 61-78 so CI only fetches and checks out immutable commit SHAs.

---

Nitpick comments:
In `@ci-operator/step-registry/aro-hcp/e2e/aro-hcp-e2e-workflow.yaml`:
- Around line 11-13: The post-step currently lists aro-hcp-lease-release before
gather steps (aro-hcp-gather-test-visualization and
aro-hcp-gather-custom-link-tools); move the aro-hcp-lease-release ref to the end
of the post sequence so both gather steps run while the lease is still held and
release happens last to ensure safe cleanup ordering.

In
`@ci-operator/step-registry/aro-hcp/gather/snapshot/aro-hcp-gather-snapshot-commands.sh`:
- Line 21: The az account set command currently prints output and should be made
consistent with az login by suppressing output; update the az account set
invocation (the line with az account set --subscription
"${GLOBAL_INFRA_SUBSCRIPTION_ID}") to include the same output suppression flag
(e.g., --output none) so CI logs remain quiet and consistent.

In
`@ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml`:
- Around line 36-38: The ARO_HCP_DEPLOY_ENV variable is defaulting to an empty
string which risks runtime failures; update the variable default for
ARO_HCP_DEPLOY_ENV to a safe fallback value (e.g., "prow") instead of "" and
ensure the documentation string remains accurate (Config environment name (prow,
ci01, int, stg, prod)).

In
`@ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.yaml`:
- Around line 1-20: Add an explicit timeout key to the aro-hcp-lease-release
step definition so the post-release command cannot hang indefinitely; modify the
ref block for "as: aro-hcp-lease-release" (the step that runs
"aro-hcp-lease-release-commands.sh") to include a sensible timeout (e.g.
"timeout: 10m" or another small cap appropriate for the release task) so stuck
releases fail fast and clearly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1da56a63-9905-41e4-a3f1-3b61e2d29f57

📥 Commits

Reviewing files that changed from the base of the PR and between 6dbcfb2 and 466f622.

⛔ Files ignored due to path filters (1)
  • ci-operator/jobs/Azure/ARO-HCP/Azure-ARO-HCP-main-presubmits.yaml is excluded by !ci-operator/jobs/**
📒 Files selected for processing (29)
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main.yaml
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__e2e.yaml
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic.yaml
  • ci-operator/step-registry/aro-hcp/e2e/aro-hcp-e2e-workflow.yaml
  • ci-operator/step-registry/aro-hcp/gather/observability/aro-hcp-gather-observability-ref.yaml
  • ci-operator/step-registry/aro-hcp/gather/provision-failure/aro-hcp-gather-provision-failure-ref.yaml
  • ci-operator/step-registry/aro-hcp/gather/snapshot/aro-hcp-gather-snapshot-commands.sh
  • ci-operator/step-registry/aro-hcp/gather/snapshot/aro-hcp-gather-snapshot-ref.yaml
  • ci-operator/step-registry/aro-hcp/lease/acquire/OWNERS
  • ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-commands.sh
  • ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml
  • ci-operator/step-registry/aro-hcp/lease/common/OWNERS
  • ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-commands.sh
  • ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-ref.yaml
  • ci-operator/step-registry/aro-hcp/lease/release/OWNERS
  • ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-commands.sh
  • ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-ref.yaml
  • ci-operator/step-registry/aro-hcp/local-e2e/aro-hcp-local-e2e-workflow.yaml
  • ci-operator/step-registry/aro-hcp/provision/environment/aro-hcp-provision-environment-commands.sh
  • ci-operator/step-registry/aro-hcp/provision/environment/aro-hcp-provision-environment-ref.yaml
  • ci-operator/step-registry/aro-hcp/test/local/aro-hcp-test-local-commands.sh
  • ci-operator/step-registry/aro-hcp/test/persistent/aro-hcp-test-persistent-commands.sh
  • ci-operator/step-registry/aro-hcp/write-config/aro-hcp-write-config-commands.sh
  • ci-operator/step-registry/aro-hcp/write-config/aro-hcp-write-config-ref.yaml
  • core-services/prow/02_config/_boskos.yaml
  • core-services/prow/02_config/generate-boskos.py
💤 Files with no reviewable changes (3)
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__e2e.yaml
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic.yaml
  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main.yaml

Comment on lines +39 to +46
- name: ARO_HCP_SLOT_MANAGER_GIT_REF
default: ""
documentation: |-
Temporary opt-in override for release-side rehearsals. When set,
the lease step fetches this ref from github.com/Azure/ARO-HCP and
runs `slot-manager` via `go run` instead of the built binary.
Normal ARO-HCP PR jobs should leave this unset because the
`aro-hcp-e2e-tests` image is already built from the PR source.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect how the override is implemented and whether ref validation/trust gating exists.
fd -i 'aro-hcp-lease-common-commands.sh' ci-operator/step-registry/aro-hcp/lease --exec sed -n '1,260p' {}

# Show all related call sites and git/go-run usage.
rg -n -C3 'ARO_HCP_SLOT_MANAGER_GIT_REF|go run|git (clone|fetch|checkout)|github.com/Azure/ARO-HCP' ci-operator/step-registry/aro-hcp/lease

# Expected: explicit check that ref is a full immutable commit SHA and/or trusted-job gating.

Repository: openshift/release

Length of output: 8874


🏁 Script executed:

# Search for where ARO_HCP_SLOT_MANAGER_GIT_REF is defined and used in prow/rehearsal configs
rg -n 'ARO_HCP_SLOT_MANAGER_GIT_REF' --type yaml --type yml

# Check for any validation logic or constraints
rg -n 'sha|commit|ref.*valid|trusted' ci-operator/step-registry/aro-hcp/lease/common --type bash -A2 -B2

# Look for rehearsal/prow job definitions for aro-hcp-lease
fd -i 'rehearse|prow' . --type f | head -20

Repository: openshift/release

Length of output: 1462


🏁 Script executed:

# Search for all uses of ARO_HCP_SLOT_MANAGER_GIT_REF across the repo
rg -n 'ARO_HCP_SLOT_MANAGER_GIT_REF' .

# Check for any validation logic for git refs in the codebase
rg -n 'git.*ref|commit.*sha|ref.*valid' ci-operator/step-registry/aro-hcp/lease/common -A2 -B2

# Check the acquire step file itself for any validation
cat -n ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml

# Check if there's a commands file for acquire step
find ci-operator/step-registry/aro-hcp/lease/acquire -name '*.sh'

Repository: openshift/release

Length of output: 4640


🏁 Script executed:

# Check the acquire commands file
cat -n ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-commands.sh

# Check release commands file for comparison
cat -n ci-operator/step-registry/aro-hcp/lease/release/aro-hcp-lease-release-commands.sh

# Look for any validation patterns in the entire aro-hcp directory
rg -n 'ref.*=~|sha.*=~|^\s*if.*ref|validate.*ref' ci-operator/step-registry/aro-hcp/ -A2

Repository: openshift/release

Length of output: 727


🏁 Script executed:

# Look for prow job definitions using aro-hcp-lease steps
rg -n 'aro-hcp-lease-acquire|aro-hcp-lease-release' ci-operator/jobs/ | head -20

# Check if there's any environment variable validation in prow job definitions
rg -n 'param|env|ARO_HCP' ci-operator/jobs/ | grep -i aro-hcp | head -30

# Look for rehearsal-specific config that might restrict variable setting
find . -name '*rehearse*' -o -name '*rehearsal*' | grep -E '\.(yaml|yml|sh)$' | head -10

Repository: openshift/release

Length of output: 3709


🏁 Script executed:

# Check if any job definitions actually set ARO_HCP_SLOT_MANAGER_GIT_REF
rg -n 'ARO_HCP_SLOT_MANAGER_GIT_REF' ci-operator/ -B5 -A5

# Look at the prow rehearsal configuration to understand variable access control
cat -n clusters/core-ci/assets/pj-rehearse.yaml | head -100

# Check if there's documentation on variable validation in prow configs
rg -n 'rehearse|param.*valid|env.*valid' clusters/core-ci/ | head -20

Repository: openshift/release

Length of output: 8485


Enforce immutable commit SHA validation for slot-manager override.

The ARO_HCP_SLOT_MANAGER_GIT_REF parameter allows arbitrary git references (branches, tags, or annotated commits) to be passed to git fetch without validation. In aro-hcp-lease-common-commands.sh line 39, the ref is fetched as-is and then executed via go run, creating non-reproducible behavior and weakening supply-chain safety. Add validation to the prepare_slot_manager_checkout() function to accept only full 40-character commit SHAs, ensuring immutable and auditable builds during rehearsals.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/aro-hcp/lease/acquire/aro-hcp-lease-acquire-ref.yaml`
around lines 39 - 46, The slot-manager override currently accepts arbitrary git
refs; update prepare_slot_manager_checkout() in aro-hcp-lease-common-commands.sh
to validate ARO_HCP_SLOT_MANAGER_GIT_REF before using it: ensure the variable is
non-empty and matches exactly a 40-character hexadecimal SHA (/[0-9a-f]{40}/i);
if it fails validation, print a clear error and exit non-zero, and only proceed
to git fetch and go run when the ref passes this SHA check to guarantee
immutable commit usage for slot-manager.

Comment on lines +29 to +40
function aro_hcp_lease::prepare_slot_manager_checkout() {
local git_ref temp_dir repo_dir
git_ref="$1"
temp_dir="$2"
repo_dir="${temp_dir}/ARO-HCP"

printf 'Using slot-manager override ref %s from %s\n' "${git_ref}" "${ARO_HCP_SLOT_MANAGER_DEFAULT_GIT_REMOTE}" >&2

git init --quiet "${repo_dir}"
git -C "${repo_dir}" remote add origin "${ARO_HCP_SLOT_MANAGER_DEFAULT_GIT_REMOTE}"
git -C "${repo_dir}" fetch --depth=1 origin "${git_ref}"
git -C "${repo_dir}" checkout --detach --quiet FETCH_HEAD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require immutable commit SHA for slot-manager override refs.

Line 39 fetches a user-provided ref and Line 77 executes it. Allowing mutable refs (branches/tags) introduces non-reproducible and avoidable supply-chain risk in CI.

Suggested hardening
 function aro_hcp_lease::prepare_slot_manager_checkout() {
     local git_ref temp_dir repo_dir
     git_ref="$1"
     temp_dir="$2"
     repo_dir="${temp_dir}/ARO-HCP"
 
+    if [[ ! "${git_ref}" =~ ^[0-9a-fA-F]{40}$ ]]; then
+        printf 'ARO_HCP_SLOT_MANAGER_GIT_REF must be a full 40-char commit SHA: %s\n' "${git_ref}" >&2
+        return 1
+    fi
+
     printf 'Using slot-manager override ref %s from %s\n' "${git_ref}" "${ARO_HCP_SLOT_MANAGER_DEFAULT_GIT_REMOTE}" >&2
 
     git init --quiet "${repo_dir}"
     git -C "${repo_dir}" remote add origin "${ARO_HCP_SLOT_MANAGER_DEFAULT_GIT_REMOTE}"
     git -C "${repo_dir}" fetch --depth=1 origin "${git_ref}"
     git -C "${repo_dir}" checkout --detach --quiet FETCH_HEAD

Also applies to: 61-78

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/aro-hcp/lease/common/aro-hcp-lease-common-commands.sh`
around lines 29 - 40, The prepare function
(aro_hcp_lease::prepare_slot_manager_checkout) currently accepts any git ref and
fetches it directly, which allows mutable refs; change it to require and
validate an immutable 40-character commit SHA (hex) for git_ref before
performing git operations, returning an error if validation fails; apply the
same SHA-only check to the sibling checkout logic in the other block covering
lines 61-78 so CI only fetches and checks out immutable commit SHAs.

@roivaz roivaz force-pushed the introduce-slot-manager branch 3 times, most recently from b3bb933 to fee1f37 Compare May 8, 2026 13:37
@openshift-ci openshift-ci Bot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label May 8, 2026
@roivaz roivaz force-pushed the introduce-slot-manager branch from fee1f37 to 30376eb Compare May 8, 2026 13:41
@roivaz roivaz force-pushed the introduce-slot-manager branch from 30376eb to 6e4fc42 Compare May 8, 2026 13:43
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@roivaz: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-Azure-ARO-HCP-main-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel-ocp-fast Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel-ocp-nightly Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel-ocp-stable Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel-ocp-fast Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel-ocp-nightly Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel-ocp-stable Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel-ocp-fast Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel-ocp-nightly Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel-ocp-stable Azure/ARO-HCP presubmit Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-integration-e2e-parallel N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-prod-e2e-parallel N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-prod-e2e-parallel-ocp-nightly N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-stage-e2e-parallel N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-stage-e2e-parallel-ocp-nightly N/A periodic Ci-operator config changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@roivaz: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-Azure-ARO-HCP-main-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel-ocp-fast Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel-ocp-nightly Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-integration-e2e-parallel-ocp-stable Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel-ocp-fast Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel-ocp-nightly Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-prod-e2e-parallel-ocp-stable Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel-ocp-fast Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel-ocp-nightly Azure/ARO-HCP presubmit Ci-operator config changed
pull-ci-Azure-ARO-HCP-main-stage-e2e-parallel-ocp-stable Azure/ARO-HCP presubmit Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-stage-e2e-parallel-ocp-nightly N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-integration-e2e-parallel N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-prod-e2e-parallel N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-prod-e2e-parallel-ocp-nightly N/A periodic Ci-operator config changed
periodic-ci-Azure-ARO-HCP-main-periodic-stage-e2e-parallel N/A periodic Ci-operator config changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant