Skip to content

ci: add stagger gate to spread out CI runs during rebase storms#466

Merged
theihor merged 1 commit intomasterfrom
stagger
Mar 17, 2026
Merged

ci: add stagger gate to spread out CI runs during rebase storms#466
theihor merged 1 commit intomasterfrom
stagger

Conversation

@theihor
Copy link
Contributor

@theihor theihor commented Mar 17, 2026

When KPD rebases all pending PR branches after an upstream commit lands (on bpf-next/master, bpf/master) or after a vmtest CI update, hundreds of workflow runs get triggered within seconds, which may cause various glitched due to increased stress on the runners.

Add a stagger script that runs as the first step of the set-matrix job and detects the "storm" condition by checking:

  • This is a PR synchronize event (force-push rebase, not a new PR)
  • The base branch was updated within the last 30 minutes (KPD just mirrored upstream)
  • Active workflow runs (queued + in-progress) are at least half the number of open PRs, indicating a bulk rebase rather than normal organic CI activity

When all conditions are met, the script sleeps for random 1-15 minute intervals in a loop. As runs complete or get cancelled the ratio drops and waiting runs proceed naturally. A hard cap of 2 hours prevents indefinite waiting.

Because the workflow already uses cancel-in-progress on a per-branch concurrency group, a newer force-push will cancel a sleeping set-matrix job before any expensive build/test work starts.

During normal operation (developer pushes, single PR rebases, new PRs) the storm condition is never true and the script exits immediately with zero delay.

Assisted-by: Claude:claude-opus-4-6

When KPD rebases all pending PR branches after an upstream commit lands
(on bpf-next/master, bpf/master) or after a vmtest CI update, hundreds
of workflow runs get triggered within seconds, which may cause various
glitched due to increased stress on the runners.

Add a stagger script that runs as the first step of the set-matrix job
and detects the "storm" condition by checking:
  - This is a PR synchronize event (force-push rebase, not a new PR)
  - The base branch was updated within the last 30 minutes (KPD just
    mirrored upstream)
  - Active workflow runs (queued + in-progress) are at least half the
    number of open PRs, indicating a bulk rebase rather than normal
    organic CI activity

When all conditions are met, the script sleeps for random 1-15 minute
intervals in a loop. As runs complete or get cancelled the ratio drops
and waiting runs proceed naturally. A hard cap of 2 hours prevents
indefinite waiting.

Because the workflow already uses cancel-in-progress on a per-branch
concurrency group, a newer force-push will cancel a sleeping set-matrix
job before any expensive build/test work starts.

During normal operation (developer pushes, single PR rebases, new PRs)
the storm condition is never true and the script exits immediately with
zero delay.

Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
@theihor theihor merged commit 57333ff into master Mar 17, 2026
71 checks passed
@theihor theihor deleted the stagger branch March 17, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant