fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED) by wdvr · Pull Request #41 · wdvr/osdc

wdvr · 2026-03-04T06:59:23Z

⚠️ WARNING: UNTESTED CHANGES

These changes fix real bugs but have NOT been tested in production. Review carefully and test before merging.

Fix 1: Force-detach EBS volumes before deletion

File: lambda/reservation_expiry/index.py (+15 lines)

Problem: When cleaning up expired reservations with persistent disks, volume deletion fails with VolumeInUse error if the volume is still attached to an instance.

Solution:

Force-detach the volume before deletion
Wait for volume to reach "available" state (up to 2 minutes)
Then delete the volume

Impact: Prevents orphaned volumes that accumulate costs when cleanup fails.

Fix 2: Don't silently fall back to EmptyDir when user requests persistent disk

File: lambda/reservation_processor/index.py (±8 lines)

Problem: When a user explicitly requests a persistent disk via disk_name parameter, if that disk setup fails (e.g., disk doesn't exist, disk in use), the code would silently fall back to EmptyDir instead of failing the reservation.

Solution:

If disk_name is set by user → fail the reservation with clear error message
If error is "disk in use" → fail the reservation (regardless of disk_name)
Only fall back to EmptyDir for old-style reservations without explicit disk_name (backward compatibility)

Impact: Users get clear error messages when their disk request fails, instead of silently getting EmptyDir when they expected persistent storage.

Test Plan (REQUIRED before merge)

Test volume cleanup: Create reservation with persistent disk, let it expire, verify volume is detached and deleted successfully
Test snapshot flow: Create reservation with persistent disk, let it expire with snapshot requested, verify snapshot completes and volume is cleaned up
Test explicit disk_name: Request reservation with disk_name for non-existent disk, verify reservation fails with clear error (not EmptyDir fallback)
Test disk in use: Request reservation with disk that's already attached, verify reservation fails with clear error
Test backward compatibility: Old-style reservation without disk_name parameter should still fall back to EmptyDir on disk errors
Monitor for orphaned volumes: Check AWS console for any EBS volumes not cleaned up properly

Files Changed

terraform-gpu-devservers/lambda/reservation_expiry/index.py: Add detach-before-delete logic
terraform-gpu-devservers/lambda/reservation_processor/index.py: Check disk_name before fallback

Do NOT merge until tested!

⚠️ WARNING: These changes are UNTESTED and need validation before merging. Fix 1: Force-detach EBS volumes before deletion (reservation_expiry) - Problem: Volume deletion fails with "VolumeInUse" if still attached - Solution: Detach volume, wait for "available" state, then delete - Impact: Prevents orphaned volumes when cleanup fails Fix 2: Fail reservation when user explicitly requests persistent disk (reservation_processor) - Problem: If user requested specific disk_name and it fails, silently fell back to EmptyDir - Solution: Fail reservation if disk_name is set OR if error is "disk in use" - Impact: Clear error messages instead of silent EmptyDir fallback Changes: - reservation_expiry/index.py: Add detach logic before delete (15 lines) - reservation_processor/index.py: Check disk_name before fallback (8 lines) Test plan needed: - [ ] Test volume cleanup after expiry with snapshot - [ ] Test explicit disk_name request with unavailable disk - [ ] Test backward compatibility for old-style reservations without disk_name - [ ] Verify no orphaned volumes remain after cleanup failures

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED)#41

fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED)#41
wdvr wants to merge 1 commit into
mainfrom
fix/disk-cleanup-and-fallback

wdvr commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Mar 4, 2026

⚠️ WARNING: UNTESTED CHANGES

Fix 1: Force-detach EBS volumes before deletion

Fix 2: Don't silently fall back to EmptyDir when user requests persistent disk

Test Plan (REQUIRED before merge)

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant