Skip to content

fix: Lambda timeout safety — deadline awareness and cleanup reordering#53

Open
wdvr wants to merge 1 commit into
mainfrom
fix/lambda-timeout-improvements
Open

fix: Lambda timeout safety — deadline awareness and cleanup reordering#53
wdvr wants to merge 1 commit into
mainfrom
fix/lambda-timeout-improvements

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Mar 8, 2026

Summary

The expiry Lambda's cleanup_pod() function had no awareness of its remaining execution time and could get killed by AWS mid-operation, leaving DynamoDB in an inconsistent state (stuck disks, locked reservations).

Issues Fixed

  • Snapshot waiter exceeded Lambda timeout: Configured to wait 30 minutes (MaxAttempts=120 * 15s) while Lambda timeout is only 15 minutes. Guaranteed timeout for large disk snapshots.
  • Disk state updated too late: mark_disk_not_in_use() only ran after snapshot completion + volume deletion. If Lambda timed out during the 30-min snapshot wait, disk stayed locked forever.
  • No internal deadline tracking: context.get_remaining_time_in_millis() was available but never used. No graceful degradation when time was running out.

Changes

  • Add deadline tracking: set_lambda_deadline(), time_remaining(), is_deadline_approaching() using Lambda context
  • Reorder cleanup_pod() into 5 phases:
    1. Mark disk not-in-use (fast DynamoDB op — prevents stuck disks)
    2. DNS + ALB cleanup (idempotent)
    3. Snapshot initiation (best-effort, skipped if <2min remaining)
    4. Delete K8s service + pod (frees GPUs)
    5. Best-effort snapshot wait + volume cleanup (deadline-aware)
  • Dynamic snapshot waiter: MaxAttempts calculated from remaining Lambda time instead of hardcoded 30 minutes
  • Deadline checks in main loop: Stop processing reservations if <2 minutes remain, defer to next invocation

What's NOT changed

  • Terraform timeout (already 900s / 15 min)
  • Memory allocation (already appropriate)
  • SQS visibility timeout (already 1000s)
  • Reservation processor Lambda (different timeout profile)

Test plan

  • Deploy to test environment and verify expiry Lambda runs successfully
  • Verify logs show deadline tracking messages (Lambda deadline set:, Xs remaining)
  • Test with a reservation that has a large persistent disk to confirm snapshot wait respects deadline
  • Verify disk is marked as not-in-use immediately (check DynamoDB state during cleanup)
  • Confirm reservation processing stops gracefully when deadline approaches
  • Monitor for any "stuck disk" incidents after deployment (should be eliminated)

The expiry Lambda had no awareness of its remaining execution time,
leading to stuck disks and inconsistent state when AWS killed the
Lambda mid-operation. Key fixes:

- Add internal deadline tracking using context.get_remaining_time_in_millis()
- Move mark_disk_not_in_use() to the START of cleanup (prevents stuck disks)
- Fix snapshot waiter that could wait 30 min (2x the 15 min Lambda timeout)
- Make snapshot wait deadline-aware (dynamically calculates MaxAttempts)
- Skip slow operations (disk content capture, snapshots) when time is short
- Add deadline check in main expiry loop to stop before timeout
- Reorder cleanup_pod() phases: state updates > DNS > pods > best-effort snapshot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant