fix: Lambda timeout safety — deadline awareness and cleanup reordering#53
Open
wdvr wants to merge 1 commit into
Open
fix: Lambda timeout safety — deadline awareness and cleanup reordering#53wdvr wants to merge 1 commit into
wdvr wants to merge 1 commit into
Conversation
The expiry Lambda had no awareness of its remaining execution time, leading to stuck disks and inconsistent state when AWS killed the Lambda mid-operation. Key fixes: - Add internal deadline tracking using context.get_remaining_time_in_millis() - Move mark_disk_not_in_use() to the START of cleanup (prevents stuck disks) - Fix snapshot waiter that could wait 30 min (2x the 15 min Lambda timeout) - Make snapshot wait deadline-aware (dynamically calculates MaxAttempts) - Skip slow operations (disk content capture, snapshots) when time is short - Add deadline check in main expiry loop to stop before timeout - Reorder cleanup_pod() phases: state updates > DNS > pods > best-effort snapshot
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The expiry Lambda's
cleanup_pod()function had no awareness of its remaining execution time and could get killed by AWS mid-operation, leaving DynamoDB in an inconsistent state (stuck disks, locked reservations).Issues Fixed
MaxAttempts=120 * 15s) while Lambda timeout is only 15 minutes. Guaranteed timeout for large disk snapshots.mark_disk_not_in_use()only ran after snapshot completion + volume deletion. If Lambda timed out during the 30-min snapshot wait, disk stayed locked forever.context.get_remaining_time_in_millis()was available but never used. No graceful degradation when time was running out.Changes
set_lambda_deadline(),time_remaining(),is_deadline_approaching()using Lambda contextcleanup_pod()into 5 phases:MaxAttemptscalculated from remaining Lambda time instead of hardcoded 30 minutesWhat's NOT changed
Test plan
Lambda deadline set:,Xs remaining)