fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED)#41
Open
wdvr wants to merge 1 commit into
Open
fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED)#41wdvr wants to merge 1 commit into
wdvr wants to merge 1 commit into
Conversation
⚠️ WARNING: These changes are UNTESTED and need validation before merging. Fix 1: Force-detach EBS volumes before deletion (reservation_expiry) - Problem: Volume deletion fails with "VolumeInUse" if still attached - Solution: Detach volume, wait for "available" state, then delete - Impact: Prevents orphaned volumes when cleanup fails Fix 2: Fail reservation when user explicitly requests persistent disk (reservation_processor) - Problem: If user requested specific disk_name and it fails, silently fell back to EmptyDir - Solution: Fail reservation if disk_name is set OR if error is "disk in use" - Impact: Clear error messages instead of silent EmptyDir fallback Changes: - reservation_expiry/index.py: Add detach logic before delete (15 lines) - reservation_processor/index.py: Check disk_name before fallback (8 lines) Test plan needed: - [ ] Test volume cleanup after expiry with snapshot - [ ] Test explicit disk_name request with unavailable disk - [ ] Test backward compatibility for old-style reservations without disk_name - [ ] Verify no orphaned volumes remain after cleanup failures
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
These changes fix real bugs but have NOT been tested in production. Review carefully and test before merging.
Fix 1: Force-detach EBS volumes before deletion
File:
lambda/reservation_expiry/index.py(+15 lines)Problem: When cleaning up expired reservations with persistent disks, volume deletion fails with
VolumeInUseerror if the volume is still attached to an instance.Solution:
Impact: Prevents orphaned volumes that accumulate costs when cleanup fails.
Fix 2: Don't silently fall back to EmptyDir when user requests persistent disk
File:
lambda/reservation_processor/index.py(±8 lines)Problem: When a user explicitly requests a persistent disk via
disk_nameparameter, if that disk setup fails (e.g., disk doesn't exist, disk in use), the code would silently fall back to EmptyDir instead of failing the reservation.Solution:
disk_nameis set by user → fail the reservation with clear error messageImpact: Users get clear error messages when their disk request fails, instead of silently getting EmptyDir when they expected persistent storage.
Test Plan (REQUIRED before merge)
disk_namefor non-existent disk, verify reservation fails with clear error (not EmptyDir fallback)disk_nameparameter should still fall back to EmptyDir on disk errorsFiles Changed
terraform-gpu-devservers/lambda/reservation_expiry/index.py: Add detach-before-delete logicterraform-gpu-devservers/lambda/reservation_processor/index.py: Check disk_name before fallbackDo NOT merge until tested!