Skip to content

Acquire install lease when provisioning a cluster#76238

Open
danilo-gemoli wants to merge 1 commit intoopenshift:mainfrom
danilo-gemoli:feat/step-registry/acquire-install-lease
Open

Acquire install lease when provisioning a cluster#76238
danilo-gemoli wants to merge 1 commit intoopenshift:mainfrom
danilo-gemoli:feat/step-registry/acquire-install-lease

Conversation

@danilo-gemoli
Copy link
Contributor

@danilo-gemoli danilo-gemoli commented Mar 13, 2026

This PR tries to mitigate the rate-limiting issue we are facing on several cloud accounts, Azure being the most affected one.
The core idea is about limiting the number of jobs, per cluster profile (hence cloud account), that are allowed to provision a cluster.

We achieve that by acquiring an install lease (see #76230) from a small pool, and holding it only for about 20m.
During the first 20m openshift-install makes a lot of requests to a cloud provider, therefore increasing the odds of being rate-limited, particularly during the CI rush hours.

There is a lot of going on in this script, but reviewing it is much easier assuming this mental model:

source "$LEASE_PROXY_CLIENT_SH"

function acquire_install_lease_atomic { ... }
function release_install_lease_delayed_atomic { ... }
function release_and_acquire_install_lease_atomic { ... }

trap 'release_install_lease_atomic' EXIT TERM INT

max=5
tries=1
ret=4
release_install_lease_pid=''
while [ $ret -eq 4 ] && [ $tries -le $max ]
do
  echo "Install attempt $tries of $max"

  if [ $tries -gt 1 ]; then
    if [[ -n "$release_install_lease_pid" ]] && ps -p "$release_install_lease_pid"; then
      kill "$release_install_lease_pid"
    fi
    release_install_lease_pid=''
    release_and_acquire_install_lease_atomic
  else
    acquire_install_lease_atomic
  fi

  release_install_lease_delayed_atomic &
  release_install_lease_pid=$!

  openshift-install create cluster
  ret="$?"

  echo "Installer exit with code $ret"
  tries=$((tries+1))
done

ipi-install-install makes several attempt to create a cluster, with regard to install lease acquisition, the execution flow performs what follows:

  1. The nth iteration starts.
  2. If a lease is being held already, then release it and acquire a new one release_and_acquire_install_lease_atomic.
    2a. Otherwise acquire one acquire_install_lease_atomic.
  3. Start a process in the background that release the just acquired lease after 20m release_install_lease_delayed_atomic &.
  4. Create a cluster openshift-install create cluster.
  5. Go to (1) if (4) fails.
  6. Otherwise exit the loop and complete the execution.
  7. Release any pending lease trap 'release_install_lease_atomic' EXIT TERM INT.

Since at least two processes are involved in this workflow, the functions acquire_lease and release_lease are atomic and they rely on the flock synchronization primitive.

The lease proxy client scripts are always available source "$LEASE_PROXY_CLIENT_SH", see openshift/ci-tools#5010.
They have been defined in #75306.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danilo-gemoli

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026
@openshift-ci openshift-ci bot requested review from dgoodwin and stbenjam March 13, 2026 14:03
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

@danilo-gemoli: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/step-registry-shellcheck c77b570 link true /test step-registry-shellcheck

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Contributor

[REHEARSALNOTIFIER]
@danilo-gemoli: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-red-hat-data-services-ods-ci-release-2.19-rhoai-ocp4.19-interop-rhoai-interop-aws red-hat-data-services/ods-ci presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-main-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-5.0-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.23-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.22-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.21-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.20-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.19-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.18-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.17-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.16-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-openshift-custom-metrics-autoscaler-operator-release-4.15-cma-e2e-aws-ovn openshift/custom-metrics-autoscaler-operator presubmit Registry content changed
pull-ci-redhat-developer-intellij-openshift-connector-main-e2e-openshift redhat-developer/intellij-openshift-connector presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-main-okd-scos-e2e-aws-ovn openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.21-okd-scos-e2e-aws-ovn openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-main-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-5.0-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.23-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.22-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.21-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.20-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.19-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.18-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.17-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed
pull-ci-openshift-baremetal-runtimecfg-release-4.16-e2e-openstack openshift/baremetal-runtimecfg presubmit Registry content changed

A total of 28371 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here
Prior to this PR being merged, you will need to either run and acknowledge or opt to skip these rehearsals.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@danilo-gemoli
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2026
cp -rfpv "$backup" "$dir"
else
date "+%F %X" > "${SHARED_DIR}/CLUSTER_INSTALL_START_TIME"
acquire_install_lease_atomic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
acquire_install_lease_atomic
acquire_install_lease_atomic || true

Perhaps? I really don't care if this fails. If it works, it will have a positive impact. If it doesn't it shouldn't change the situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, so this just an optimization and the whole installation process shouldn't fail if we can't acquire such a lease.

Copy link
Member

@stbenjam stbenjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice idea

}
export -f release_and_acquire_install_lease_atomic

trap 'release_install_lease_atomic' EXIT TERM INT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trap overwrites previous traps set (prepare_next_steps) which I think will cause problems

Simple example

#!/bin/bash
function prepare_next_steps() {
    echo "prepare_next_steps called"
}

function release_install_lease_atomic() {
    echo "release_install_lease_atomic called"
}

trap 'prepare_next_steps' EXIT TERM INT
trap 'release_install_lease_atomic' EXIT TERM INT
echo "End, watch which trap fires"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We can use trap chaining. Something like:

add_trap() {
    local new_cmd="$1"
    local signal="$2"
    
    # Extract the current trap command
    local existing_cmd
    existing_cmd=$(trap -p "$signal" | sed "s/trap -- '\(.*\)' $signal/\1/")
    
    if [[ -z "$existing_cmd" ]]; then
        trap "$new_cmd" "$signal"
    else
        # Prepend or append; usually appending is safer for cleanup
        trap "$existing_cmd; $new_cmd" "$signal"
    fi
}

fi
export INSTALL_LEASE_ENABLED

export RELEASE_LEASE_DELAY=20m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 50 jobs start at the exact same time (e.g. release controller) and all get a lease, the installers will likely all be doing the same things at the same time on the same cloud provider. They'll then all release around the exact same time, and another 50 might start.

We might want to stagger lease acquisition randomly, by having a delay before we acquire the lease. I tried to solve a similar problem in openshift/release-controller#737, but it is simpler to do here.

  delay=$(( RANDOM % 901 ))
  printf 'Waiting %dm%ds before acquiring install lease\n' $(( delay / 60 )) $(( delay % 60 ))
  sleep $delay
  acquire_install_lease_atomic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be an optimization, but the intent here is to ratchet down to a known sustainable number of concurrent installers. Both in terms of count & duration. Once we find the sweet spot, introducing jitter could allow us to increase count, but at this point, non-determinism could confuse the ratcheting process.

Copy link
Member

@stbenjam stbenjam Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-determinism could confuse the ratcheting process

What would be confused? Jitter before install start would be more valuable than limiting concurrent installs, IMHO. There's a huge number of aggregated jobs that hit 1-2 minute blips in build infrastructure that end up killing the entire payload because we go below the threshold we need for statistical confidence. Not to mention the thundering herd on specific cloud resources (e.g. all RC jobs creating load balancers at the same time)

Copy link
Member

@stbenjam stbenjam Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without jitter, I think this PR will make the situation worse. As it applies globally to everything installed for that cloud provider, we're going to start triggering more installs to occur in simultaneous waves.

Imagine we limit to 50 concurrent installs.

  • RC triggers 50 jobs. Over those 20 minutes, 200 more pile up. 200 jobs are now sitting in a "wait" state (instead of starting at least offset by each other a limit bit)

  • The moment those first 50 leases expire (at exactly $t + 20$ minutes), another 50 will start the exact same time.

  • 20 minutes later, 50 more start.

Instead of a chaotic but distributed flow -- with peaks and valleys -- you’ve created a square wave pattern. You will see 100% utilization of the lease bucket, and always have high numbers of installs starting at the exact same time, compounding the problem of "all the installers doing the same thing in the cloud at the same time"

@danilo-gemoli
Copy link
Contributor Author

@jupierce @stbenjam I have improved the designed of the library, have a look at #76538. The use case Use Case 4 - Refresh install leases applies here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants