Skip to content

fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED)#41

Open
wdvr wants to merge 1 commit into
mainfrom
fix/disk-cleanup-and-fallback
Open

fix: persistent disk cleanup and fallback behavior (⚠️ UNTESTED)#41
wdvr wants to merge 1 commit into
mainfrom
fix/disk-cleanup-and-fallback

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Mar 4, 2026

⚠️ WARNING: UNTESTED CHANGES

These changes fix real bugs but have NOT been tested in production. Review carefully and test before merging.


Fix 1: Force-detach EBS volumes before deletion

File: lambda/reservation_expiry/index.py (+15 lines)

Problem: When cleaning up expired reservations with persistent disks, volume deletion fails with VolumeInUse error if the volume is still attached to an instance.

Solution:

  • Force-detach the volume before deletion
  • Wait for volume to reach "available" state (up to 2 minutes)
  • Then delete the volume

Impact: Prevents orphaned volumes that accumulate costs when cleanup fails.


Fix 2: Don't silently fall back to EmptyDir when user requests persistent disk

File: lambda/reservation_processor/index.py (±8 lines)

Problem: When a user explicitly requests a persistent disk via disk_name parameter, if that disk setup fails (e.g., disk doesn't exist, disk in use), the code would silently fall back to EmptyDir instead of failing the reservation.

Solution:

  • If disk_name is set by user → fail the reservation with clear error message
  • If error is "disk in use" → fail the reservation (regardless of disk_name)
  • Only fall back to EmptyDir for old-style reservations without explicit disk_name (backward compatibility)

Impact: Users get clear error messages when their disk request fails, instead of silently getting EmptyDir when they expected persistent storage.


Test Plan (REQUIRED before merge)

  • Test volume cleanup: Create reservation with persistent disk, let it expire, verify volume is detached and deleted successfully
  • Test snapshot flow: Create reservation with persistent disk, let it expire with snapshot requested, verify snapshot completes and volume is cleaned up
  • Test explicit disk_name: Request reservation with disk_name for non-existent disk, verify reservation fails with clear error (not EmptyDir fallback)
  • Test disk in use: Request reservation with disk that's already attached, verify reservation fails with clear error
  • Test backward compatibility: Old-style reservation without disk_name parameter should still fall back to EmptyDir on disk errors
  • Monitor for orphaned volumes: Check AWS console for any EBS volumes not cleaned up properly

Files Changed

  • terraform-gpu-devservers/lambda/reservation_expiry/index.py: Add detach-before-delete logic
  • terraform-gpu-devservers/lambda/reservation_processor/index.py: Check disk_name before fallback

Do NOT merge until tested!

⚠️ WARNING: These changes are UNTESTED and need validation before merging.

Fix 1: Force-detach EBS volumes before deletion (reservation_expiry)
- Problem: Volume deletion fails with "VolumeInUse" if still attached
- Solution: Detach volume, wait for "available" state, then delete
- Impact: Prevents orphaned volumes when cleanup fails

Fix 2: Fail reservation when user explicitly requests persistent disk (reservation_processor)
- Problem: If user requested specific disk_name and it fails, silently fell back to EmptyDir
- Solution: Fail reservation if disk_name is set OR if error is "disk in use"
- Impact: Clear error messages instead of silent EmptyDir fallback

Changes:
- reservation_expiry/index.py: Add detach logic before delete (15 lines)
- reservation_processor/index.py: Check disk_name before fallback (8 lines)

Test plan needed:
- [ ] Test volume cleanup after expiry with snapshot
- [ ] Test explicit disk_name request with unavailable disk
- [ ] Test backward compatibility for old-style reservations without disk_name
- [ ] Verify no orphaned volumes remain after cleanup failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant