fix: queue reservations when no single node has enough GPUs + bin-pack pods#35
Open
wdvr wants to merge 3 commits into
Open
fix: queue reservations when no single node has enough GPUs + bin-pack pods#35wdvr wants to merge 3 commits into
wdvr wants to merge 3 commits into
Conversation
…k pods The availability check was summing GPUs across all nodes, so a request for 8 GPUs would see "12 available" (4+3+5 across 3 nodes) and try to create the pod. But k8s needs all GPUs on one node, so the pod would get stuck in Pending for 600s then fail. Now check_gpu_availability returns (total, max_per_node) and scheduling decisions use max_per_node. When no single node can fulfill the request, the reservation queues properly instead of creating an unschedulable pod. Also adds pod affinity (weight=50) to prefer nodes already running gpu-dev pods. This bin-packs smaller reservations onto the same nodes, keeping whole nodes free for large (8-GPU) requests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lots On-demand ASG launch templates had no capacity_reservation_specification, which defaults to AWS "open" behavior - auto-matching instances to any available targeted CR in the same AZ with the same instance type. This caused the b200-cr2 (on-demand) instance to consume a slot in cr-08e7fee0b8dc3de5e (cr1), preventing the cr1 ASG from launching its 3rd instance. Now on-demand launch templates explicitly set capacity_reservation_preference = "none". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- cr1: 3→2 instances (use 2 of 3 CR slots, freeing 1 for other team; cr1 already only has 2 running so no instances terminated) - cr2: removed (kills the cordoned on-demand node that was occupying a CR slot due to the old index-shift bug) Only the cordoned node gets terminated. End state: 3 B200 nodes (24 GPUs) = cr0(1) + cr1(2) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e212dcd to
7bd2604
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix GPU fragmentation scheduling bug: The availability check was summing GPUs across all nodes, so requesting 8 GPUs would see "12 available" (spread across 3 nodes as 4+3+5) and immediately try to create the pod. Since k8s requires all GPUs on a single node, the pod would get stuck
Pendingfor 600s then fail. Nowcheck_gpu_availabilityreturns(total, max_per_node)and scheduling decisions usemax_per_node. When no single node can fulfill the request, the reservation queues properly instead of creating an unschedulable pod.Add pod affinity for bin-packing: Adds a soft pod affinity (weight 50) that prefers nodes already running gpu-dev pods. This packs smaller reservations onto the same nodes, keeping whole nodes free for large (8-GPU) requests. The profiling node preference (weight 100) still takes priority.
Better queue messages: When queued due to fragmentation, the message now says
"Need 8 B200 GPUs on one node, max 5 available on any single node"instead of the misleading"only 12 available".What was happening
What happens now
Test plan
🤖 Generated with Claude Code