Skip to content

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#69

Closed
ivanbasov wants to merge 6 commits into
mainfrom
worktree-timing
Closed

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#69
ivanbasov wants to merge 6 commits into
mainfrom
worktree-timing

Conversation

@ivanbasov
Copy link
Copy Markdown
Collaborator

Summary

  • Adds a single warmup forward pass through pipeline_module before the timing loop in compute_logical_error_rate
  • Triggers torch.compile lazy compilation so the JIT cost does not inflate the first-batch timing measurement
  • Guard: only runs when trt_context is None and _applied_compile (torch-only path with compile enabled)
  • CUDA sync after the warmup pass on GPU devices

Motivation

Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost, skewing Phase Timing numbers — especially at low sample counts (PREDECODER_INFERENCE_NUM_SAMPLES=1):

Model forward (first batch)
Before ~887 ms
After ~1 ms

With large sample counts the JIT cost gets amortised naturally, but at small counts it dominates and makes Phase Timing numbers misleading. Proposed by Igor Almeida Baratta; approved by Ben Howe.

Test plan

  • Existing unit tests pass (test_inference_latency_timing.py, test_tensorrt_fallback.py)
  • Run with PREDECODER_INFERENCE_NUM_SAMPLES=1, confirm first-batch model-forward time matches steady-state
  • Run with TRT enabled, confirm warmup block is skipped
  • CI green

🤖 Generated with Claude Code

ivanbasov and others added 6 commits March 30, 2026 11:54
…fault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… JIT

Without this, the first batch in the timing loop bears the full
torch.compile lazy-compilation cost (~887 ms vs ~1 ms steady-state),
skewing Phase Timing numbers — especially at low sample counts like
PREDECODER_INFERENCE_NUM_SAMPLES=1.  The warmup only runs when
torch.compile is active and TRT is not in use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extracts the warmup block into a named helper so it can be tested in
isolation.  Five tests cover: fires when compile is active (CPU), skipped
when compile is off, skipped when TRT context is present, CUDA sync called
on GPU device, CUDA sync not called on CPU device.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ivanbasov ivanbasov closed this Apr 20, 2026
@bmhowe23 bmhowe23 deleted the worktree-timing branch May 5, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant