perf: parallelize EBS disk creation with other setup work by wdvr · Pull Request #56 · wdvr/osdc

wdvr · 2026-03-09T08:17:30Z

Summary

Split create_disk_from_snapshot_or_empty into two phases: start_disk_creation() (non-blocking) and wait_for_disk_ready() (blocking)
create_volume API returns immediately with a volume_id — we now do GitHub key fetching + EFS setup while AWS creates the volume in the background
Reduces reservation time by ~10-15s (the combined time of GitHub keys + EFS that previously ran sequentially before disk creation)
Reduced SSH daemon poll interval from 10s to 3s (same 3-min max, but detects SSH readiness ~7-27s faster since the image has openssh pre-installed)

Bug fixes included

Orphan volume cleanup: If allocation fails after volume creation (e.g., pod scheduling fails), the orphaned EBS volume is now deleted in the outer except block. Previously these would leak.
Early volume_id storage: ebs_volume_id is now written to DynamoDB immediately after create_volume returns, so cancel/cleanup can always find the volume even if the reservation fails mid-way.

New flow

PHASE 1: start_disk_creation() → returns volume_id (volume still 'creating')
         ↓ store volume_id in DynamoDB immediately
PHASE 2: github_keys_fetch (2s)    ← runs while volume is being created
         efs_setup (5-10s)          ← runs while volume is being created
PHASE 3: wait_for_disk_ready()     ← likely already done, 0-30s remaining
         pod_create → ssh_check

Test plan

Run a reservation with --trace and compare disk_create_start→disk_create_end vs disk_wait_start→disk_create_end (the gap shows time saved)
Cancel a reservation mid-creation — verify orphaned volume gets cleaned up
Run reservation without persistent disk — verify no regression
Run reservation where disk creation fails — verify fallback to emptyDir still works
Verify SSH detection is faster with 3s poll (check ssh_ready_check_start→ssh_ready_check_end in trace)

Split disk creation into start_disk_creation() (non-blocking) and wait_for_disk_ready() (blocking). The create_volume API call returns immediately while AWS creates the volume in the background. We now do GitHub key fetching and EFS setup during that time, then wait for the volume only right before pod creation. Also: - Store ebs_volume_id in DynamoDB immediately after create_volume so cancel/cleanup can always find and clean up orphaned volumes - Add orphan volume cleanup in the outer except block when allocation fails after volume creation - Reduce SSH daemon poll interval from 10s to 3s (60 retries) since default image has openssh-server pre-installed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize EBS disk creation with other setup work#56

perf: parallelize EBS disk creation with other setup work#56
wdvr wants to merge 1 commit into
mainfrom
perf/parallel-disk-creation

wdvr commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Mar 9, 2026

Summary

Bug fixes included

New flow

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant