From 7f0f6c860d48edaf9db43ddf010e80f0bac75447 Mon Sep 17 00:00:00 2001
From: Ivan Basov <ibasov@nvidia.com>
Date: Mon, 30 Mar 2026 11:54:58 -0700
Subject: [PATCH 1/3] fix(ci): disable torch.compile in orientation training to
 prevent segfault

torch.compile=on combined with DataLoader spawn workers during LER
validation causes a segfault (20 leaked semaphores, core dumped).
Set PREDECODER_TORCH_COMPILE=0 for the Train all orientations step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 .github/workflows/long-running-tests.yml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.github/workflows/long-running-tests.yml b/.github/workflows/long-running-tests.yml
index f536780..3c9e268 100644
--- a/.github/workflows/long-running-tests.yml
+++ b/.github/workflows/long-running-tests.yml
@@ -184,6 +184,7 @@ jobs:
           PREDECODER_VAL_SAMPLES: "4096"
           PREDECODER_TEST_SAMPLES: "4096"
           PREDECODER_TRAIN_EPOCHS: "1"
+          PREDECODER_TORCH_COMPILE: "0"
 
       - name: Multi-orientation inference (O1–O4) with LER output check
         shell: bash

From 9d3fa086d9768091054aabe95606fea3424002f6 Mon Sep 17 00:00:00 2001
From: Ivan Basov <ibasov@nvidia.com>
Date: Mon, 30 Mar 2026 11:57:04 -0700
Subject: [PATCH 2/3] Revert "fix(ci): disable torch.compile in orientation
 training to prevent segfault"

This reverts commit 7f0f6c860d48edaf9db43ddf010e80f0bac75447.
---
 .github/workflows/long-running-tests.yml | 1 -
 1 file changed, 1 deletion(-)

diff --git a/.github/workflows/long-running-tests.yml b/.github/workflows/long-running-tests.yml
index 3c9e268..f536780 100644
--- a/.github/workflows/long-running-tests.yml
+++ b/.github/workflows/long-running-tests.yml
@@ -184,7 +184,6 @@ jobs:
           PREDECODER_VAL_SAMPLES: "4096"
           PREDECODER_TEST_SAMPLES: "4096"
           PREDECODER_TRAIN_EPOCHS: "1"
-          PREDECODER_TORCH_COMPILE: "0"
 
       - name: Multi-orientation inference (O1–O4) with LER output check
         shell: bash

From 4fe899d8a6c030d01b75f48da6eeafe624520e37 Mon Sep 17 00:00:00 2001
From: Ivan Basov <ibasov@nvidia.com>
Date: Tue, 21 Apr 2026 11:06:56 -0700
Subject: [PATCH 3/3] docs: add training recommendations for epochs, shots, and
 noise upscaling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Advise ≥100 epochs for models 1, 4, 5 (uncorrelated matching) in both
  README and TRAINING env-var table
- Document 67 M shots/epoch with 8 GPUs as the recommended sample count
- Add concise algorithm summary to the noise upscaling section (p_max →
  rescale to 0.6 %)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 README.md   | 7 +++++++
 TRAINING.md | 4 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index a0b398b..4ef6bfc 100644
--- a/README.md
+++ b/README.md
@@ -269,6 +269,11 @@ Each `model_id` has a fixed receptive field \(R\):
 - **model 4**: \(R=13\)
 - **model 5**: \(R=13\)
 
+#### Training recommendations
+
+- **Models 1, 4, 5 (uncorrelated matching):** Train for at least **100 epochs**. Fewer epochs will yield under-trained models.
+- **Shots per epoch:** Use **67 million** shots per epoch when training with 8 GPUs (`PREDECODER_TRAIN_SAMPLES=67108864`). Using fewer shots per epoch produces worse results.
+
 #### Distance / rounds semantics
 
 - Top-level `distance` / `n_rounds` are the **evaluation targets** (what you care about in inference).
@@ -396,6 +401,8 @@ The five grouped totals are:
 - If `max_group >= 6e-3`: parameters are **not** modified (the training log emits a warning in case this indicates a configuration error).
 - Non-surface-code types (`code_type != "surface_code"`) are never upscaled.
 
+**Algorithm in brief:** The pipeline stores `p_max = max(P_prep, P_meas, P_idle_cnot, P_idle_spam, P_cnot)` from the full 25-parameter noise vector and rescales the entire vector by `0.006 / p_max` so that `p_max` is raised to **0.6%** (6 × 10⁻³). The original noise model is preserved unchanged for evaluation.
+
 We have found that training on denser syndromes and then evaluating on sparser data produces better results than training directly on sparse data.
 
 #### Skipping noise upscaling
diff --git a/TRAINING.md b/TRAINING.md
index d6ae31f..d3d71d7 100644
--- a/TRAINING.md
+++ b/TRAINING.md
@@ -141,8 +141,8 @@ export CONFIG_NAME=config_qec_decoder_r13_fp8
 
 | Variable | Default | Description |
 |----------|---------|-------------|
-| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. |
-| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. |
+| `PREDECODER_TRAIN_EPOCHS` | `100` | Total number of training epochs. For models 1, 4, 5 (uncorrelated matching), use at least **100** epochs; fewer epochs will yield under-trained models. |
+| `PREDECODER_TRAIN_SAMPLES` | config-defined | Samples per epoch. Bypasses auto-scaling when set explicitly. For best results with 8 GPUs, use **67 million** shots per epoch (`67108864`); fewer shots per epoch will produce worse results. |
 | `PREDECODER_LR_MILESTONES` | config-defined | Comma-separated LR schedule milestone fractions (e.g. `0.25,0.5,1.0`). |
 | `PREDECODER_TIMING_RUN` | unset | Set `1` for timing/benchmarking mode (disables some overhead). |
 | `PREDECODER_TORCH_COMPILE` | `0` when run via `sbatch_train.sh`, otherwise unset | `0` to disable `torch.compile`, `1` to enable. |