Skip to content

feat: auto-scale CPU nodes (2-30) instead of fixed 60#37

Open
wdvr wants to merge 1 commit into
mainfrom
feat/cpu-autoscaling
Open

feat: auto-scale CPU nodes (2-30) instead of fixed 60#37
wdvr wants to merge 1 commit into
mainfrom
feat/cpu-autoscaling

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Mar 4, 2026

Summary

  • CPU ASGs (cpu-arm, cpu-x86) now scale dynamically between 2-30 nodes based on demand, instead of running 30 fixed instances each 24/7
  • The availability updater Lambda manages scaling: scales up when spare slots < 2, scales down (with hysteresis) when spare slots > 8
  • Instance protection prevents terminating nodes with active gpu-dev pods during scale-down
  • CLI gpu-dev avail shows scalable capacity (6 / 90) and ~3min (scaling up) wait estimate

Files changed

File Change
main.tf CPU types: instance_count=2, min_instance_count=2, max_instance_count=30 (both workspaces)
eks.tf ASG min_size/max_size use new fields via try(), lifecycle { ignore_changes = [desired_capacity] }
availability.tf IAM: SetDesiredCapacity, SetInstanceProtection, ec2:DescribeInstances
availability_updater/index.py Autoscaling logic, instance protection, scalable_total in DynamoDB
cli.py + reservations.py Display scalable capacity and scale-up time estimate

Test plan

  • terraform plan — confirm CPU ASGs change from 30/30/30 to 2/2/30, GPU ASGs unchanged
  • After tf apply: CPU nodes scale down to 2 each
  • Scale-up test: reserve CPU slot → Lambda detects < 2 spare → sets desired +1 → node joins in ~3min
  • Scale-down test: cancel all CPU reservations → Lambda detects > 8 spare → protects occupied nodes → reduces desired
  • gpu-dev avail shows total / scalable_total format for CPU types and ~3min (scaling up) when at 0
  • GPU ASGs remain completely unaffected (no min/max_instance_count fields = try() falls back to instance_count)

CPU ASGs (cpu-arm, cpu-x86) now scale dynamically between 2-30 nodes
based on demand, instead of running 30 fixed instances each 24/7.

The availability updater Lambda manages scaling:
- Scales UP when spare slots < 2 (adds nodes in ~3min)
- Scales DOWN when spare slots > 8 (with hysteresis to avoid flapping)
- Uses instance protection to prevent terminating nodes with active pods

Changes:
- main.tf: CPU types get min/max_instance_count, desired starts at 2
- eks.tf: ASG uses min/max when present, lifecycle ignores desired_capacity
- availability_updater Lambda: autoscaling logic + instance protection
- availability.tf: IAM permissions for SetDesiredCapacity, SetInstanceProtection
- CLI: shows "current / scalable" totals and "~3min (scaling up)" wait estimate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant