From 7f0f6c860d48edaf9db43ddf010e80f0bac75447 Mon Sep 17 00:00:00 2001 From: Ivan Basov Date: Mon, 30 Mar 2026 11:54:58 -0700 Subject: [PATCH 1/3] fix(ci): disable torch.compile in orientation training to prevent segfault torch.compile=on combined with DataLoader spawn workers during LER validation causes a segfault (20 leaked semaphores, core dumped). Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step. Co-Authored-By: Claude Sonnet 4.6 --- .github/workflows/long-running-tests.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/long-running-tests.yml b/.github/workflows/long-running-tests.yml index f536780..3c9e268 100644 --- a/.github/workflows/long-running-tests.yml +++ b/.github/workflows/long-running-tests.yml @@ -184,6 +184,7 @@ jobs: PREDECODER_VAL_SAMPLES: "4096" PREDECODER_TEST_SAMPLES: "4096" PREDECODER_TRAIN_EPOCHS: "1" + PREDECODER_TORCH_COMPILE: "0" - name: Multi-orientation inference (O1–O4) with LER output check shell: bash From 9d3fa086d9768091054aabe95606fea3424002f6 Mon Sep 17 00:00:00 2001 From: Ivan Basov Date: Mon, 30 Mar 2026 11:57:04 -0700 Subject: [PATCH 2/3] Revert "fix(ci): disable torch.compile in orientation training to prevent segfault" This reverts commit 7f0f6c860d48edaf9db43ddf010e80f0bac75447. --- .github/workflows/long-running-tests.yml | 1 - 1 file changed, 1 deletion(-) diff --git a/.github/workflows/long-running-tests.yml b/.github/workflows/long-running-tests.yml index 3c9e268..f536780 100644 --- a/.github/workflows/long-running-tests.yml +++ b/.github/workflows/long-running-tests.yml @@ -184,7 +184,6 @@ jobs: PREDECODER_VAL_SAMPLES: "4096" PREDECODER_TEST_SAMPLES: "4096" PREDECODER_TRAIN_EPOCHS: "1" - PREDECODER_TORCH_COMPILE: "0" - name: Multi-orientation inference (O1–O4) with LER output check shell: bash From 4fe899d8a6c030d01b75f48da6eeafe624520e37 Mon Sep 17 00:00:00 2001 From: Ivan Basov Date: Tue, 21 Apr 2026 11:06:56 -0700 Subject: [PATCH 3/3] docs: add training recommendations for epochs, shots, and noise upscaling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Advise ≥100 epochs for models 1, 4, 5 (uncorrelated matching) in both README and TRAINING env-var table - Document 67 M shots/epoch with 8 GPUs as the recommended sample count - Add concise algorithm summary to the noise upscaling section (p_max → rescale to 0.6 %) Co-Authored-By: Claude Sonnet 4.6 --- README.md | 7 +++++++ TRAINING.md | 4 ++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index a0b398b..4ef6bfc 100644 --- a/README.md +++ b/README.md @@ -269,6 +269,11 @@ Each `model_id` has a fixed receptive field \(R\): - **model 4**: \(R=13\) - **model 5**: \(R=13\) +#### Training recommendations + +- **Models 1, 4, 5 (uncorrelated matching):** Train for at least **100 epochs**. Fewer epochs will yield under-trained models. +- **Shots per epoch:** Use **67 million** shots per epoch when training with 8 GPUs (`PREDECODER_TRAIN_SAMPLES=67108864`). Using fewer shots per epoch produces worse results. + #### Distance / rounds semantics - Top-level `distance` / `n_rounds` are the **evaluation targets** (what you care about in inference). @@ -396,6 +401,8 @@ The five grouped totals are: - If `max_group >= 6e-3`: parameters are **not** modified (the training log emits a warning in case this indicates a configuration error). - Non-surface-code types (`code_type != "surface_code"`) are never upscaled. +**Algorithm in brief:** The pipeline stores `p_max = max(P_prep, P_meas, P_idle_cnot, P_idle_spam, P_cnot)` from the full 25-parameter noise vector and rescales the entire vector by `0.006 / p_max` so that `p_max` is raised to **0.6%** (6 × 10⁻³). The original noise model is preserved unchanged for evaluation. + We have found that training on denser syndromes and then evaluating on sparser data produces better results than training directly on sparse data. #### Skipping noise upscaling diff --git a/TRAINING.md b/TRAINING.md index d6ae31f..d3d71d7 100644 --- a/TRAINING.md +++ b/TRAINING.md @@ -141,8 +141,8 @@ export CONFIG_NAME=config_qec_decoder_r13_fp8 | Variable | Default | Description | |----------|---------|-------------| -| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. | -| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. | +| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. For models 1, 4, 5 (uncorrelated matching), use at least **100** epochs; fewer epochs will yield under-trained models. | +| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. For best results with 8 GPUs, use **67 million** shots per epoch (`67108864`); fewer shots per epoch will produce worse results. | | `PREDECODER_LR_MILESTONES` | config-defined | Comma-separated LR schedule milestone fractions (e.g. `0.25,0.5,1.0`). | | `PREDECODER_TIMING_RUN` | unset | Set `1` for timing/benchmarking mode (disables some overhead). | | `PREDECODER_TORCH_COMPILE` | `0` when run via `sbatch_train.sh`, otherwise unset | `0` to disable `torch.compile`, `1` to enable. |