fix(timing): warmup pass before timing loop to amortise torch.compile JIT by ivanbasov · Pull Request #69 · NVIDIA/Ising-Decoding

ivanbasov · 2026-04-20T18:04:48Z

Summary

Adds a single warmup forward pass through pipeline_module before the timing loop in compute_logical_error_rate
Triggers torch.compile lazy compilation so the JIT cost does not inflate the first-batch timing measurement
Guard: only runs when trt_context is None and _applied_compile (torch-only path with compile enabled)
CUDA sync after the warmup pass on GPU devices

Motivation

Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost, skewing Phase Timing numbers — especially at low sample counts (PREDECODER_INFERENCE_NUM_SAMPLES=1):

	Model forward (first batch)
Before	~887 ms
After	~1 ms

With large sample counts the JIT cost gets amortised naturally, but at small counts it dominates and makes Phase Timing numbers misleading. Proposed by Igor Almeida Baratta; approved by Ben Howe.

Test plan

Existing unit tests pass (test_inference_latency_timing.py, test_tensorrt_fallback.py)
Run with PREDECODER_INFERENCE_NUM_SAMPLES=1, confirm first-batch model-forward time matches steady-state
Run with TRT enabled, confirm warmup block is skipped
CI green

🤖 Generated with Claude Code

…fault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…vent segfault" This reverts commit 7f0f6c8.

… JIT Without this, the first batch in the timing loop bears the full torch.compile lazy-compilation cost (~887 ms vs ~1 ms steady-state), skewing Phase Timing numbers — especially at low sample counts like PREDECODER_INFERENCE_NUM_SAMPLES=1. The warmup only runs when torch.compile is active and TRT is not in use. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extracts the warmup block into a named helper so it can be tested in isolation. Five tests cover: fires when compile is active (CPU), skipped when compile is off, skipped when TRT context is present, CUDA sync called on GPU device, CUDA sync not called on CPU device. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov and others added 6 commits March 30, 2026 11:54

Revert "fix(ci): disable torch.compile in orientation training to pre…

9d3fa08

…vent segfault" This reverts commit 7f0f6c8.

Merge remote-tracking branch 'upstream/main'

838d14f

style: yapf formatting on test_inference_latency_timing

b347157

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ivanbasov closed this Apr 20, 2026

bmhowe23 deleted the worktree-timing branch May 5, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#69

fix(timing): warmup pass before timing loop to amortise torch.compile JIT#69
ivanbasov wants to merge 6 commits into
mainfrom
worktree-timing

ivanbasov commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivanbasov commented Apr 20, 2026

Summary

Motivation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant