Skip to content

feat: use Karpenter for CPU node autoscaling (industry standard)#40

Open
wdvr wants to merge 1 commit into
mainfrom
feat/karpenter-cpu-autoscaling
Open

feat: use Karpenter for CPU node autoscaling (industry standard)#40
wdvr wants to merge 1 commit into
mainfrom
feat/karpenter-cpu-autoscaling

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Mar 4, 2026

Summary

Replace fixed 60 CPU nodes with Karpenter-managed dynamic scaling (0-30 nodes per type). This is an alternative to PR #37's custom Lambda approach, using the industry-standard EKS autoscaling solution.

Comparison: Karpenter vs Custom Lambda

Feature Karpenter (this PR) Custom Lambda (PR #37)
Scale-up speed ~60 seconds (event-driven) ~3 minutes (polling)
Scale-down logic Built-in consolidation with PDB Custom instance protection code
Code to maintain ~0 lines (declarative CRDs) ~100 lines of Python
Industry adoption AWS recommended, battle-tested Custom implementation
Reaction trigger Pod pending event (immediate) ASG lifecycle hook (delayed)

Architecture

Karpenter controller:

  • Runs on management CPU nodes (2x c5.4xlarge, managed by ASG)
  • Watches for pending pods with specific node selectors
  • Provisions nodes in ~60s when needed
  • Consolidates idle nodes after 60s

NodePools (2):

  • cpu-x86: c7i.8xlarge (prod) / c7i.4xlarge (default), max 30 nodes
  • cpu-arm: c7g.8xlarge (prod) / c7g.4xlarge (default), max 30 nodes
  • Each pool: 500GB gp3 root, AL2023, on-demand only

Interruption handling:

  • EventBridge rules forward EC2 state-change events to SQS
  • Karpenter drains nodes gracefully on spot interruption

Files changed

File Change
main.tf CPU types: karpenter_managed=true, instance_count=0
eks.tf Filter Karpenter types from ASG creation (1 line: if !try(gpu_config.karpenter_managed, false))
karpenter.tf New: 459 lines (IAM, SQS, EventBridge, Helm, NodePools, EC2NodeClasses)
availability_updater/index.py Handle Karpenter types (query K8s API for node count, not ASG)
cli.py + reservations.py Show "0 / 90" scalable capacity and "~1min (scaling up)"

Test plan

  • terraform plan — CPU ASGs removed, Karpenter resources created
  • After tf apply: 0 CPU nodes initially, management nodes running Karpenter
  • Scale-up test: reserve CPU slot → pod pending → Karpenter provisions node in ~60s → pod schedules
  • Scale-down test: cancel reservation → pod deleted → Karpenter consolidates idle node in ~60s
  • gpu-dev avail shows 0 / 90 and ~1min (scaling up) for CPU types
  • GPU ASGs completely unaffected (no karpenter_managed field)

Why Karpenter?

This is the AWS-recommended approach for EKS autoscaling. It's:

  • Faster: 60s vs 3min (no polling delay)
  • Event-driven: Reacts to pod pending events immediately
  • Less code: Replaces custom autoscaling logic with declarative CRDs
  • Battle-tested: Used by thousands of EKS clusters in production
  • Simpler: No custom instance protection, no ASG polling, no scale-up/down thresholds

The tradeoff is a bigger infrastructure change (Karpenter install, new IAM roles, EventBridge setup) vs the quick Lambda approach in PR #37. But for production, Karpenter is the right choice.

…ambda approach)

Replace fixed 60 CPU nodes with Karpenter-managed dynamic scaling (0-30 nodes per type).
Karpenter provisions nodes on-demand when pods are pending (~60s) and consolidates
when idle (60s). This is the industry-standard approach for EKS autoscaling.

Key changes:
- main.tf: CPU types set to karpenter_managed=true, instance_count=0
- eks.tf: Filter Karpenter types from ASG creation
- karpenter.tf: Full Karpenter setup (IAM, SQS, Helm, NodePools, EC2NodeClasses)
- availability_updater Lambda: Handle Karpenter types (query K8s directly, not ASG)
- CLI: Show "0 / 90" scalable capacity and "~1min (scaling up)" wait time

Architecture:
- Karpenter controller runs on management CPU nodes (c5.4xlarge ASG, 2 nodes)
- NodePool per architecture (cpu-x86, cpu-arm) with 30-node CPU limits
- EC2NodeClass defines AL2023, 500GB gp3, security groups, subnets
- SQS queue for spot/interruption handling (EventBridge → SQS → Karpenter)

Benefits vs custom Lambda approach:
- Faster scale-up (~60s vs ~3min ASG polling)
- Event-driven (reacts to pending pods immediately)
- Built-in consolidation with pod disruption budgets
- Less custom code to maintain (~100 lines removed)
- Industry standard for EKS

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant