fix: Lambda timeout safety — deadline awareness and cleanup reordering by wdvr · Pull Request #53 · wdvr/osdc

wdvr · 2026-03-08T18:34:01Z

Summary

The expiry Lambda's cleanup_pod() function had no awareness of its remaining execution time and could get killed by AWS mid-operation, leaving DynamoDB in an inconsistent state (stuck disks, locked reservations).

Issues Fixed

Snapshot waiter exceeded Lambda timeout: Configured to wait 30 minutes (MaxAttempts=120 * 15s) while Lambda timeout is only 15 minutes. Guaranteed timeout for large disk snapshots.
Disk state updated too late: mark_disk_not_in_use() only ran after snapshot completion + volume deletion. If Lambda timed out during the 30-min snapshot wait, disk stayed locked forever.
No internal deadline tracking: context.get_remaining_time_in_millis() was available but never used. No graceful degradation when time was running out.

Changes

Add deadline tracking: set_lambda_deadline(), time_remaining(), is_deadline_approaching() using Lambda context
Reorder cleanup_pod() into 5 phases:
1. Mark disk not-in-use (fast DynamoDB op — prevents stuck disks)
2. DNS + ALB cleanup (idempotent)
3. Snapshot initiation (best-effort, skipped if <2min remaining)
4. Delete K8s service + pod (frees GPUs)
5. Best-effort snapshot wait + volume cleanup (deadline-aware)
Dynamic snapshot waiter: MaxAttempts calculated from remaining Lambda time instead of hardcoded 30 minutes
Deadline checks in main loop: Stop processing reservations if <2 minutes remain, defer to next invocation

What's NOT changed

Terraform timeout (already 900s / 15 min)
Memory allocation (already appropriate)
SQS visibility timeout (already 1000s)
Reservation processor Lambda (different timeout profile)

Test plan

Deploy to test environment and verify expiry Lambda runs successfully
Verify logs show deadline tracking messages (Lambda deadline set:, Xs remaining)
Test with a reservation that has a large persistent disk to confirm snapshot wait respects deadline
Verify disk is marked as not-in-use immediately (check DynamoDB state during cleanup)
Confirm reservation processing stops gracefully when deadline approaches
Monitor for any "stuck disk" incidents after deployment (should be eliminated)

The expiry Lambda had no awareness of its remaining execution time, leading to stuck disks and inconsistent state when AWS killed the Lambda mid-operation. Key fixes: - Add internal deadline tracking using context.get_remaining_time_in_millis() - Move mark_disk_not_in_use() to the START of cleanup (prevents stuck disks) - Fix snapshot waiter that could wait 30 min (2x the 15 min Lambda timeout) - Make snapshot wait deadline-aware (dynamically calculates MaxAttempts) - Skip slow operations (disk content capture, snapshots) when time is short - Add deadline check in main expiry loop to stop before timeout - Reorder cleanup_pod() phases: state updates > DNS > pods > best-effort snapshot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Lambda timeout safety — deadline awareness and cleanup reordering#53

fix: Lambda timeout safety — deadline awareness and cleanup reordering#53
wdvr wants to merge 1 commit into
mainfrom
fix/lambda-timeout-improvements

wdvr commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Mar 8, 2026

Summary

Issues Fixed

Changes

What's NOT changed

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant