feat: use Karpenter for CPU node autoscaling (industry standard) by wdvr · Pull Request #40 · wdvr/osdc

wdvr · 2026-03-04T06:53:16Z

Summary

Replace fixed 60 CPU nodes with Karpenter-managed dynamic scaling (0-30 nodes per type). This is an alternative to PR #37's custom Lambda approach, using the industry-standard EKS autoscaling solution.

Comparison: Karpenter vs Custom Lambda

Feature	Karpenter (this PR)	Custom Lambda (PR #37)
Scale-up speed	~60 seconds (event-driven)	~3 minutes (polling)
Scale-down logic	Built-in consolidation with PDB	Custom instance protection code
Code to maintain	~0 lines (declarative CRDs)	~100 lines of Python
Industry adoption	AWS recommended, battle-tested	Custom implementation
Reaction trigger	Pod pending event (immediate)	ASG lifecycle hook (delayed)

Architecture

Karpenter controller:

Runs on management CPU nodes (2x c5.4xlarge, managed by ASG)
Watches for pending pods with specific node selectors
Provisions nodes in ~60s when needed
Consolidates idle nodes after 60s

NodePools (2):

cpu-x86: c7i.8xlarge (prod) / c7i.4xlarge (default), max 30 nodes
cpu-arm: c7g.8xlarge (prod) / c7g.4xlarge (default), max 30 nodes
Each pool: 500GB gp3 root, AL2023, on-demand only

Interruption handling:

EventBridge rules forward EC2 state-change events to SQS
Karpenter drains nodes gracefully on spot interruption

Files changed

File	Change
`main.tf`	CPU types: `karpenter_managed=true`, `instance_count=0`
`eks.tf`	Filter Karpenter types from ASG creation (1 line: `if !try(gpu_config.karpenter_managed, false)`)
`karpenter.tf`	New: 459 lines (IAM, SQS, EventBridge, Helm, NodePools, EC2NodeClasses)
`availability_updater/index.py`	Handle Karpenter types (query K8s API for node count, not ASG)
`cli.py` + `reservations.py`	Show `"0 / 90"` scalable capacity and `"~1min (scaling up)"`

Test plan

terraform plan — CPU ASGs removed, Karpenter resources created
After tf apply: 0 CPU nodes initially, management nodes running Karpenter
Scale-up test: reserve CPU slot → pod pending → Karpenter provisions node in ~60s → pod schedules
Scale-down test: cancel reservation → pod deleted → Karpenter consolidates idle node in ~60s
gpu-dev avail shows 0 / 90 and ~1min (scaling up) for CPU types
GPU ASGs completely unaffected (no karpenter_managed field)

Why Karpenter?

This is the AWS-recommended approach for EKS autoscaling. It's:

Faster: 60s vs 3min (no polling delay)
Event-driven: Reacts to pod pending events immediately
Less code: Replaces custom autoscaling logic with declarative CRDs
Battle-tested: Used by thousands of EKS clusters in production
Simpler: No custom instance protection, no ASG polling, no scale-up/down thresholds

The tradeoff is a bigger infrastructure change (Karpenter install, new IAM roles, EventBridge setup) vs the quick Lambda approach in PR #37. But for production, Karpenter is the right choice.

…ambda approach) Replace fixed 60 CPU nodes with Karpenter-managed dynamic scaling (0-30 nodes per type). Karpenter provisions nodes on-demand when pods are pending (~60s) and consolidates when idle (60s). This is the industry-standard approach for EKS autoscaling. Key changes: - main.tf: CPU types set to karpenter_managed=true, instance_count=0 - eks.tf: Filter Karpenter types from ASG creation - karpenter.tf: Full Karpenter setup (IAM, SQS, Helm, NodePools, EC2NodeClasses) - availability_updater Lambda: Handle Karpenter types (query K8s directly, not ASG) - CLI: Show "0 / 90" scalable capacity and "~1min (scaling up)" wait time Architecture: - Karpenter controller runs on management CPU nodes (c5.4xlarge ASG, 2 nodes) - NodePool per architecture (cpu-x86, cpu-arm) with 30-node CPU limits - EC2NodeClass defines AL2023, 500GB gp3, security groups, subnets - SQS queue for spot/interruption handling (EventBridge → SQS → Karpenter) Benefits vs custom Lambda approach: - Faster scale-up (~60s vs ~3min ASG polling) - Event-driven (reacts to pending pods immediately) - Built-in consolidation with pod disruption budgets - Less custom code to maintain (~100 lines removed) - Industry standard for EKS Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use Karpenter for CPU node autoscaling (industry standard)#40

feat: use Karpenter for CPU node autoscaling (industry standard)#40
wdvr wants to merge 1 commit into
mainfrom
feat/karpenter-cpu-autoscaling

wdvr commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Mar 4, 2026

Summary

Comparison: Karpenter vs Custom Lambda

Architecture

Files changed

Test plan

Why Karpenter?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant