Skip to content

perf: parallelize EBS disk creation with other setup work#56

Draft
wdvr wants to merge 1 commit into
mainfrom
perf/parallel-disk-creation
Draft

perf: parallelize EBS disk creation with other setup work#56
wdvr wants to merge 1 commit into
mainfrom
perf/parallel-disk-creation

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Mar 9, 2026

Summary

  • Split create_disk_from_snapshot_or_empty into two phases: start_disk_creation() (non-blocking) and wait_for_disk_ready() (blocking)
  • create_volume API returns immediately with a volume_id — we now do GitHub key fetching + EFS setup while AWS creates the volume in the background
  • Reduces reservation time by ~10-15s (the combined time of GitHub keys + EFS that previously ran sequentially before disk creation)
  • Reduced SSH daemon poll interval from 10s to 3s (same 3-min max, but detects SSH readiness ~7-27s faster since the image has openssh pre-installed)

Bug fixes included

  • Orphan volume cleanup: If allocation fails after volume creation (e.g., pod scheduling fails), the orphaned EBS volume is now deleted in the outer except block. Previously these would leak.
  • Early volume_id storage: ebs_volume_id is now written to DynamoDB immediately after create_volume returns, so cancel/cleanup can always find the volume even if the reservation fails mid-way.

New flow

PHASE 1: start_disk_creation() → returns volume_id (volume still 'creating')
         ↓ store volume_id in DynamoDB immediately
PHASE 2: github_keys_fetch (2s)    ← runs while volume is being created
         efs_setup (5-10s)          ← runs while volume is being created
PHASE 3: wait_for_disk_ready()     ← likely already done, 0-30s remaining
         pod_create → ssh_check

Test plan

  • Run a reservation with --trace and compare disk_create_start→disk_create_end vs disk_wait_start→disk_create_end (the gap shows time saved)
  • Cancel a reservation mid-creation — verify orphaned volume gets cleaned up
  • Run reservation without persistent disk — verify no regression
  • Run reservation where disk creation fails — verify fallback to emptyDir still works
  • Verify SSH detection is faster with 3s poll (check ssh_ready_check_start→ssh_ready_check_end in trace)

Split disk creation into start_disk_creation() (non-blocking) and
wait_for_disk_ready() (blocking). The create_volume API call returns
immediately while AWS creates the volume in the background. We now
do GitHub key fetching and EFS setup during that time, then wait
for the volume only right before pod creation.

Also:
- Store ebs_volume_id in DynamoDB immediately after create_volume
  so cancel/cleanup can always find and clean up orphaned volumes
- Add orphan volume cleanup in the outer except block when
  allocation fails after volume creation
- Reduce SSH daemon poll interval from 10s to 3s (60 retries)
  since default image has openssh-server pre-installed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant