Prevent infinite retries of autoscaling by Pearl1594 · Pull Request #9574 · apache/cloudstack

Pearl1594 · 2024-08-22T21:55:54Z

Description

This PR fixes: #9318

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)
build/CI
test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Major
Minor

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Pearl1594 · 2024-08-22T21:57:46Z

@weizhouapache would like some advice on this issue. Do you think that if the number of all VMs including those in error and stopped states >= max size then we should stop scaling any further Or do you think if there are VMs in error state we need to retry for a few iterations?

codecov · 2024-08-22T22:05:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 4.30%. Comparing base (6e6a276) to head (463d2a6).
Report is 302 commits behind head on 4.19.

❗ There is a different number of reports uploaded between BASE (6e6a276) and HEAD (463d2a6). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (6e6a276) HEAD (463d2a6)

unittests 1 0

Additional details and impacted files

@@             Coverage Diff              @@
##               4.19   #9574       +/-   ##
============================================
- Coverage     15.08%   4.30%   -10.79%     
============================================
  Files          5406     366     -5040     
  Lines        472889   29514   -443375     
  Branches      57738    5162    -52576     
============================================
- Hits          71352    1270    -70082     
+ Misses       393593   28100   -365493     
+ Partials       7944     144     -7800

Flag	Coverage Δ
uitests	`4.30% <ø> (ø)`
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

shwstppr · 2024-08-23T06:18:40Z

engine/schema/src/main/java/com/cloud/network/as/dao/AutoScaleVmGroupVmMapDaoImpl.java

        sc.setParameters("vmGroupId", vmGroupId);
        sc.setJoinParameters("vmSearch", "states",
-                State.Starting, State.Running, State.Stopping, State.Migrating);
+                State.Starting, State.Running, State.Stopping, State.Migrating, State.Error, State.Stopped);


This may make autoscaler to not retry deployment if any deployment goes into error state intermittently. Should we include Error state after some n reties?

Yeah, that was my worry too, wanted some inputs on it. I'll do that. Thanks @shwstppr

weizhouapache · 2025-01-21T14:58:59Z

@Pearl1594
I am ok with excluding the vms in Error state in the vm count.
should we consider Stopped vms as available in the group ?

my other advice would be

add a setting for max retries (as @shwstppr advised)
change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried
users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

Pearl1594 · 2025-01-21T18:14:05Z

I added Stopped State imagining a scenario where in say for whatever reason a host goes down, all VMs belonging to the autoscale group, on that host would enter stopped state. This would cause additional VMs to be redeployed. I thought of it as a problematic scenario. But maybe that's how it should behave, I am not sure of the scope of autoscaling. Maybe you could shed some light on this @weizhouapache

btzq · 2025-01-28T10:31:27Z

@Pearl1594 I am ok with excluding the vms in Error state in the vm count. should we consider Stopped vms as available in the group ?

my other advice would be

add a setting for max retries (as @shwstppr advised)

change to another state (for example Disabled, or new state Alert ?) after max retries so it will not be retried

users can change back to Enabled when the root cause issue is solved

cc @DaanHoogland @btzq

Yea i think that works. As a user, even if there is anything wrong with the VM, id still like it to be considered under the ASG Group. If not, i would not know i had a faulty VM as i would only find out by going through my list of VMs outside the ASG.

Pearl1594 · 2025-07-25T13:39:24Z

closing this as #11244 addresses it

Prevent infinite retires of autoscaling

463d2a6

boring-cyborg bot added the component:networking label Aug 22, 2024

shwstppr reviewed Aug 23, 2024

View reviewed changes

DaanHoogland changed the title ~~Prevent infinite retires of autoscaling~~ Prevent infinite retries of autoscaling Aug 23, 2024

DaanHoogland mentioned this pull request Jan 20, 2025

Autoscale VMs do not follow ScaleDown Rules after a Node Failure #9336

Closed

sureshanaparti linked an issue Apr 14, 2025 that may be closed by this pull request

Autoscale Group creates Infinite Number Of VMs when unable to startup #9318

Closed

DaanHoogland force-pushed the 4.19 branch from 8ef8b46 to 857ccb0 Compare May 27, 2025 14:20

sureshanaparti added this to the 4.19.4 milestone Jun 5, 2025

Pearl1594 closed this Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent infinite retries of autoscaling#9574

Prevent infinite retries of autoscaling#9574
Pearl1594 wants to merge 1 commit intoapache:4.19from
shapeblue:fix-infinite-autoscaling-retries

Pearl1594 commented Aug 22, 2024

Uh oh!

Pearl1594 commented Aug 22, 2024

Uh oh!

codecov bot commented Aug 22, 2024 •

edited

Loading

Uh oh!

shwstppr Aug 23, 2024

Uh oh!

Pearl1594 Aug 23, 2024

Uh oh!

weizhouapache commented Jan 21, 2025

Uh oh!

Pearl1594 commented Jan 21, 2025 •

edited

Loading

Uh oh!

btzq commented Jan 28, 2025

Uh oh!

Pearl1594 commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Pearl1594 commented Aug 22, 2024

Description

Types of changes

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

Bug Severity

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

Uh oh!

Pearl1594 commented Aug 22, 2024

Uh oh!

codecov bot commented Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shwstppr Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

Pearl1594 Aug 23, 2024

Choose a reason for hiding this comment

Uh oh!

weizhouapache commented Jan 21, 2025

Uh oh!

Pearl1594 commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

btzq commented Jan 28, 2025

Uh oh!

Pearl1594 commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov bot commented Aug 22, 2024 •

edited

Loading

Pearl1594 commented Jan 21, 2025 •

edited

Loading