Skip to content

fix: queue reservations when no single node has enough GPUs + bin-pack pods#35

Open
wdvr wants to merge 3 commits into
mainfrom
fix/queue-on-fragmented-gpu-availability
Open

fix: queue reservations when no single node has enough GPUs + bin-pack pods#35
wdvr wants to merge 3 commits into
mainfrom
fix/queue-on-fragmented-gpu-availability

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Feb 20, 2026

Summary

  • Fix GPU fragmentation scheduling bug: The availability check was summing GPUs across all nodes, so requesting 8 GPUs would see "12 available" (spread across 3 nodes as 4+3+5) and immediately try to create the pod. Since k8s requires all GPUs on a single node, the pod would get stuck Pending for 600s then fail. Now check_gpu_availability returns (total, max_per_node) and scheduling decisions use max_per_node. When no single node can fulfill the request, the reservation queues properly instead of creating an unschedulable pod.

  • Add pod affinity for bin-packing: Adds a soft pod affinity (weight 50) that prefers nodes already running gpu-dev pods. This packs smaller reservations onto the same nodes, keeping whole nodes free for large (8-GPU) requests. The profiling node preference (weight 100) still takes priority.

  • Better queue messages: When queued due to fragmentation, the message now says "Need 8 B200 GPUs on one node, max 5 available on any single node" instead of the misleading "only 12 available".

What was happening

User requests 8x B200 GPUs
→ Lambda checks: 4+3+5 = 12 B200 GPUs available ✓
→ Creates pod requesting 8 GPUs
→ k8s can't schedule: no single node has 8 free
→ Pod stuck Pending for 600s
→ Lambda times out → reservation marked "failed"

What happens now

User requests 8x B200 GPUs
→ Lambda checks: max on single node = 5, need 8
→ Queues reservation (position #1)
→ Scheduled Lambda retries when a node frees up

Test plan

  • Verify 8-GPU B200 reservation queues when no single node has 8 free (instead of failing after 600s timeout)
  • Verify small reservations (1-4 GPUs) still schedule immediately when capacity exists
  • Verify bin-packing: new 1-GPU reservation lands on a node that already has pods (not an empty node)
  • Verify queued reservations are picked up when a node fully frees up

🤖 Generated with Claude Code

wdvr and others added 3 commits February 20, 2026 00:35
…k pods

The availability check was summing GPUs across all nodes, so a request
for 8 GPUs would see "12 available" (4+3+5 across 3 nodes) and try to
create the pod. But k8s needs all GPUs on one node, so the pod would
get stuck in Pending for 600s then fail.

Now check_gpu_availability returns (total, max_per_node) and scheduling
decisions use max_per_node. When no single node can fulfill the request,
the reservation queues properly instead of creating an unschedulable pod.

Also adds pod affinity (weight=50) to prefer nodes already running
gpu-dev pods. This bin-packs smaller reservations onto the same nodes,
keeping whole nodes free for large (8-GPU) requests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lots

On-demand ASG launch templates had no capacity_reservation_specification,
which defaults to AWS "open" behavior - auto-matching instances to any
available targeted CR in the same AZ with the same instance type.

This caused the b200-cr2 (on-demand) instance to consume a slot in
cr-08e7fee0b8dc3de5e (cr1), preventing the cr1 ASG from launching its
3rd instance. Now on-demand launch templates explicitly set
capacity_reservation_preference = "none".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- cr1: 3→2 instances (use 2 of 3 CR slots, freeing 1 for other team;
  cr1 already only has 2 running so no instances terminated)
- cr2: removed (kills the cordoned on-demand node that was occupying
  a CR slot due to the old index-shift bug)

Only the cordoned node gets terminated.
End state: 3 B200 nodes (24 GPUs) = cr0(1) + cr1(2)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wdvr wdvr force-pushed the fix/queue-on-fragmented-gpu-availability branch from e212dcd to 7bd2604 Compare February 20, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant