Skip to content

Prevent infinite retries of autoscaling#9574

Closed
Pearl1594 wants to merge 1 commit intoapache:4.19from
shapeblue:fix-infinite-autoscaling-retries
Closed

Prevent infinite retries of autoscaling#9574
Pearl1594 wants to merge 1 commit intoapache:4.19from
shapeblue:fix-infinite-autoscaling-retries

Conversation

@Pearl1594
Copy link
Contributor

Description

This PR fixes: #9318

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI
  • test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

@Pearl1594
Copy link
Contributor Author

@weizhouapache would like some advice on this issue. Do you think that if the number of all VMs including those in error and stopped states >= max size then we should stop scaling any further Or do you think if there are VMs in error state we need to retry for a few iterations?

@codecov
Copy link

codecov bot commented Aug 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.30%. Comparing base (6e6a276) to head (463d2a6).
Report is 302 commits behind head on 4.19.

❗ There is a different number of reports uploaded between BASE (6e6a276) and HEAD (463d2a6). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (6e6a276) HEAD (463d2a6)
unittests 1 0
Additional details and impacted files
@@             Coverage Diff              @@
##               4.19   #9574       +/-   ##
============================================
- Coverage     15.08%   4.30%   -10.79%     
============================================
  Files          5406     366     -5040     
  Lines        472889   29514   -443375     
  Branches      57738    5162    -52576     
============================================
- Hits          71352    1270    -70082     
+ Misses       393593   28100   -365493     
+ Partials       7944     144     -7800     
Flag Coverage Δ
uitests 4.30% <ø> (ø)
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sc.setParameters("vmGroupId", vmGroupId);
sc.setJoinParameters("vmSearch", "states",
State.Starting, State.Running, State.Stopping, State.Migrating);
State.Starting, State.Running, State.Stopping, State.Migrating, State.Error, State.Stopped);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may make autoscaler to not retry deployment if any deployment goes into error state intermittently. Should we include Error state after some n reties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was my worry too, wanted some inputs on it. I'll do that. Thanks @shwstppr

@DaanHoogland DaanHoogland changed the title Prevent infinite retires of autoscaling Prevent infinite retries of autoscaling Aug 23, 2024
@weizhouapache
Copy link
Member

@Pearl1594
I am ok with excluding the vms in Error state in the vm count.
should we consider Stopped vms as available in the group ?

my other advice would be

  • add a setting for max retries (as @shwstppr advised)
  • change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
  • users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

@Pearl1594
Copy link
Contributor Author

Pearl1594 commented Jan 21, 2025

I added Stopped State imagining a scenario where in say for whatever reason a host goes down, all VMs belonging to the autoscale group, on that host would enter stopped state. This would cause additional VMs to be redeployed. I thought of it as a problematic scenario. But maybe that's how it should behave, I am not sure of the scope of autoscaling. Maybe you could shed some light on this @weizhouapache

@btzq
Copy link

btzq commented Jan 28, 2025

@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?

my other advice would be

  • add a setting for max retries (as @shwstppr advised)
  • change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
  • users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

Yea i think that works. As a user, even if there is anything wrong with the VM, id still like it to be considered under the ASG Group. If not, i would not know i had a faulty VM as i would only find out by going through my list of VMs outside the ASG.

@Pearl1594
Copy link
Contributor Author

closing this as #11244 addresses it

@Pearl1594 Pearl1594 closed this Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Autoscale Group creates Infinite Number Of VMs when unable to startup

5 participants