fix: queue reservations when no single node has enough GPUs + bin-pack pods by wdvr · Pull Request #35 · wdvr/osdc

wdvr · 2026-02-20T08:35:58Z

Summary

Fix GPU fragmentation scheduling bug: The availability check was summing GPUs across all nodes, so requesting 8 GPUs would see "12 available" (spread across 3 nodes as 4+3+5) and immediately try to create the pod. Since k8s requires all GPUs on a single node, the pod would get stuck Pending for 600s then fail. Now check_gpu_availability returns (total, max_per_node) and scheduling decisions use max_per_node. When no single node can fulfill the request, the reservation queues properly instead of creating an unschedulable pod.
Add pod affinity for bin-packing: Adds a soft pod affinity (weight 50) that prefers nodes already running gpu-dev pods. This packs smaller reservations onto the same nodes, keeping whole nodes free for large (8-GPU) requests. The profiling node preference (weight 100) still takes priority.
Better queue messages: When queued due to fragmentation, the message now says "Need 8 B200 GPUs on one node, max 5 available on any single node" instead of the misleading "only 12 available".

What was happening

User requests 8x B200 GPUs
→ Lambda checks: 4+3+5 = 12 B200 GPUs available ✓
→ Creates pod requesting 8 GPUs
→ k8s can't schedule: no single node has 8 free
→ Pod stuck Pending for 600s
→ Lambda times out → reservation marked "failed"

What happens now

User requests 8x B200 GPUs
→ Lambda checks: max on single node = 5, need 8
→ Queues reservation (position #1)
→ Scheduled Lambda retries when a node frees up

Test plan

Verify 8-GPU B200 reservation queues when no single node has 8 free (instead of failing after 600s timeout)
Verify small reservations (1-4 GPUs) still schedule immediately when capacity exists
Verify bin-packing: new 1-GPU reservation lands on a node that already has pods (not an empty node)
Verify queued reservations are picked up when a node fully frees up

🤖 Generated with Claude Code

…k pods The availability check was summing GPUs across all nodes, so a request for 8 GPUs would see "12 available" (4+3+5 across 3 nodes) and try to create the pod. But k8s needs all GPUs on one node, so the pod would get stuck in Pending for 600s then fail. Now check_gpu_availability returns (total, max_per_node) and scheduling decisions use max_per_node. When no single node can fulfill the request, the reservation queues properly instead of creating an unschedulable pod. Also adds pod affinity (weight=50) to prefer nodes already running gpu-dev pods. This bin-packs smaller reservations onto the same nodes, keeping whole nodes free for large (8-GPU) requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…lots On-demand ASG launch templates had no capacity_reservation_specification, which defaults to AWS "open" behavior - auto-matching instances to any available targeted CR in the same AZ with the same instance type. This caused the b200-cr2 (on-demand) instance to consume a slot in cr-08e7fee0b8dc3de5e (cr1), preventing the cr1 ASG from launching its 3rd instance. Now on-demand launch templates explicitly set capacity_reservation_preference = "none". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- cr1: 3→2 instances (use 2 of 3 CR slots, freeing 1 for other team; cr1 already only has 2 running so no instances terminated) - cr2: removed (kills the cordoned on-demand node that was occupying a CR slot due to the old index-shift bug) Only the cordoned node gets terminated. End state: 3 B200 nodes (24 GPUs) = cr0(1) + cr1(2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wdvr and others added 3 commits February 20, 2026 00:35

wdvr force-pushed the fix/queue-on-fragmented-gpu-availability branch from e212dcd to 7bd2604 Compare February 20, 2026 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: queue reservations when no single node has enough GPUs + bin-pack pods#35

fix: queue reservations when no single node has enough GPUs + bin-pack pods#35
wdvr wants to merge 3 commits into
mainfrom
fix/queue-on-fragmented-gpu-availability

wdvr commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Feb 20, 2026

Summary

What was happening

What happens now

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant