feat: auto-scale CPU nodes (2-30) instead of fixed 60#37
Open
wdvr wants to merge 1 commit into
Open
Conversation
CPU ASGs (cpu-arm, cpu-x86) now scale dynamically between 2-30 nodes based on demand, instead of running 30 fixed instances each 24/7. The availability updater Lambda manages scaling: - Scales UP when spare slots < 2 (adds nodes in ~3min) - Scales DOWN when spare slots > 8 (with hysteresis to avoid flapping) - Uses instance protection to prevent terminating nodes with active pods Changes: - main.tf: CPU types get min/max_instance_count, desired starts at 2 - eks.tf: ASG uses min/max when present, lifecycle ignores desired_capacity - availability_updater Lambda: autoscaling logic + instance protection - availability.tf: IAM permissions for SetDesiredCapacity, SetInstanceProtection - CLI: shows "current / scalable" totals and "~3min (scaling up)" wait estimate
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
cpu-arm,cpu-x86) now scale dynamically between 2-30 nodes based on demand, instead of running 30 fixed instances each 24/7gpu-dev availshows scalable capacity (6 / 90) and~3min (scaling up)wait estimateFiles changed
main.tfinstance_count=2,min_instance_count=2,max_instance_count=30(both workspaces)eks.tfmin_size/max_sizeuse new fields viatry(),lifecycle { ignore_changes = [desired_capacity] }availability.tfSetDesiredCapacity,SetInstanceProtection,ec2:DescribeInstancesavailability_updater/index.pyscalable_totalin DynamoDBcli.py+reservations.pyTest plan
terraform plan— confirm CPU ASGs change from 30/30/30 to 2/2/30, GPU ASGs unchangedtf apply: CPU nodes scale down to 2 eachgpu-dev availshowstotal / scalable_totalformat for CPU types and~3min (scaling up)when at 0try()falls back to instance_count)