feat: auto-scale CPU nodes (2-30) instead of fixed 60 by wdvr · Pull Request #37 · wdvr/osdc

wdvr · 2026-03-04T05:10:00Z

Summary

CPU ASGs (cpu-arm, cpu-x86) now scale dynamically between 2-30 nodes based on demand, instead of running 30 fixed instances each 24/7
The availability updater Lambda manages scaling: scales up when spare slots < 2, scales down (with hysteresis) when spare slots > 8
Instance protection prevents terminating nodes with active gpu-dev pods during scale-down
CLI gpu-dev avail shows scalable capacity (6 / 90) and ~3min (scaling up) wait estimate

Files changed

File	Change
`main.tf`	CPU types: `instance_count=2`, `min_instance_count=2`, `max_instance_count=30` (both workspaces)
`eks.tf`	ASG `min_size`/`max_size` use new fields via `try()`, `lifecycle { ignore_changes = [desired_capacity] }`
`availability.tf`	IAM: `SetDesiredCapacity`, `SetInstanceProtection`, `ec2:DescribeInstances`
`availability_updater/index.py`	Autoscaling logic, instance protection, `scalable_total` in DynamoDB
`cli.py` + `reservations.py`	Display scalable capacity and scale-up time estimate

Test plan

terraform plan — confirm CPU ASGs change from 30/30/30 to 2/2/30, GPU ASGs unchanged
After tf apply: CPU nodes scale down to 2 each
Scale-up test: reserve CPU slot → Lambda detects < 2 spare → sets desired +1 → node joins in ~3min
Scale-down test: cancel all CPU reservations → Lambda detects > 8 spare → protects occupied nodes → reduces desired
gpu-dev avail shows total / scalable_total format for CPU types and ~3min (scaling up) when at 0
GPU ASGs remain completely unaffected (no min/max_instance_count fields = try() falls back to instance_count)

CPU ASGs (cpu-arm, cpu-x86) now scale dynamically between 2-30 nodes based on demand, instead of running 30 fixed instances each 24/7. The availability updater Lambda manages scaling: - Scales UP when spare slots < 2 (adds nodes in ~3min) - Scales DOWN when spare slots > 8 (with hysteresis to avoid flapping) - Uses instance protection to prevent terminating nodes with active pods Changes: - main.tf: CPU types get min/max_instance_count, desired starts at 2 - eks.tf: ASG uses min/max when present, lifecycle ignores desired_capacity - availability_updater Lambda: autoscaling logic + instance protection - availability.tf: IAM permissions for SetDesiredCapacity, SetInstanceProtection - CLI: shows "current / scalable" totals and "~3min (scaling up)" wait estimate

wdvr mentioned this pull request Mar 4, 2026

feat: use Karpenter for CPU node autoscaling (industry standard) #40

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-scale CPU nodes (2-30) instead of fixed 60#37

feat: auto-scale CPU nodes (2-30) instead of fixed 60#37
wdvr wants to merge 1 commit into
mainfrom
feat/cpu-autoscaling

wdvr commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Mar 4, 2026

Summary

Files changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant