Speed up spacelike HE with optional parallel execution#81
Merged
Conversation
Adds a new data.use_parallel_spacelike flag (default False) that runs the spacelike homological-equivalence pass via a 2-partition of the stabilizer-overlap graph. The two color classes are reduced independently inside a torch.compile-friendly inner loop, cutting compiled-graph crossings per spacelike pass on GPU. Algorithm in code/qec/surface_code/homological_equivalence_torch.py: builds a 2-coloring of the stabilizer-overlap graph at cache time, pre-packs compile inputs into cache.parallel_partition_packed, and adds a parallel weight-reduction plus weight-2 fix-equivalence path. The 2-coloring assumes the overlap graph is bipartite (holds for the rotated single-basis surface code); non-bipartite inputs are rejected by _build_spacelike_partition with a named diagnostic so misuses fail loudly. Wiring through memory_circuit_torch.py, generator_torch.py, training/train.py plus a False default in workflows/config_validator.py keeps existing configs unchanged. Tests added in a follow-up commit. Signed-off-by: kvmto <kmato@nvidia.com>
Adds correctness coverage for the data.use_parallel_spacelike code path introduced in the previous commit. code/tests/mid/test_homological_equivalence.py: bipartite partition validity across supported distances; named-failure diagnostic on non-bipartite overlap graphs; parallel weight-2 fix-equivalence moves errors off boundary stabilizers exactly like the sequential path; parallel path is idempotent and matches the sequential path on weight-2-only inputs; cache.parallel_partition_packed is populated at cache-build time with correct padding for empty colors; and the production hot path reads the pre-packed view rather than re-packing on every call. code/tests/test_gpu.py: CUDA test that the compiled parallel spacelike path produces bit-identical output to the eager parallel path, locking in the pack-once cache field and the float-only chunk convergence check. All new tests gate on the same fixtures and surface-code distances as the existing HE tests; no new external dependencies, no new test infrastructure. Signed-off-by: kvmto <kmato@nvidia.com>
Adds user-facing documentation and discoverability for the data.use_parallel_spacelike flag introduced in the previous two commits. README.md: new 'HE acceleration (advanced): parallel spacelike' subsection at the end of 'Configuration and advanced usage', covering how to enable (yaml + CLI), three pros (GPU speedup on rotated single-basis surface code, canonical equivalence to the sequential path with test coverage, composes with use_weight2), and four caveats (rotated single-basis only with bipartite overlap graph, use_compile=True required for the speedup, slightly higher cache-build cost and memory, GPU-targeted). conf/config_public.yaml: one-off surfacing of use_parallel_spacelike inside the data block, with a comment explaining that other HE knobs intentionally remain in internal defaults and pointing to the README section. conf/config_pre_decoder_memory_surface_model_1_d9.yaml: list the flag alongside the existing HE knobs (timelike_he, num_he_cycles, use_weight2, max_passes_*) so the advanced config exposes the full HE surface. No code or test changes in this commit. Signed-off-by: kvmto <kmato@nvidia.com>
Adds a new step to the existing multi-gpu-tests job that forces data.use_compile=True + data.use_parallel_spacelike=True, exercising the parallel + compiled spacelike HE path end-to-end under DDP on 2 GPUs. The existing default-config step is left untouched so coverage of the default path does not regress. Failure modes specific to the new combination (per-rank device pinning of the partition, torch.compile cache contention across ranks, deadlocks during the compiled inner loop) would surface as a training crash here, so this step closes the multi-GPU coverage gap that the unit tests alone do not exercise. Bumps the multi-gpu-tests timeout-minutes from 20 to 35 to accommodate the second smoke step. The job runs only on push to main (if: github.ref == 'refs/heads/main'), matching the existing step's gating; PR builds are unaffected. Signed-off-by: kvmto <kmato@nvidia.com>
…se_compile
The docs(he) commit surfaced `data.use_parallel_spacelike` in
`conf/config_public.yaml` and documented an `EXTRA_PARAMS=
"data.use_compile=True data.use_parallel_spacelike=True"` enable
recipe in README.md, but `validate_public_config` still only allowed
`data.{code_rotation,noise_model}`. Result: loading the shipped
`config_public.yaml` would raise `ValueError: Config field
'data.use_parallel_spacelike' is not supported in the public release`,
the documented CLI recipe would fail the same way for both keys, and
the existing `test_inference_public_model` plus the new multi-GPU CI
smoke step would crash on first run.
Fix: extend `allowed_data_keys` in
`code/workflows/config_validator.py` to include `use_compile` and
`use_parallel_spacelike`. Both default to `False` in the hidden
defaults / `getattr(..., False)` call sites, so opt-in behaviour is
unchanged; only the validator gate is relaxed. Add a focused type
check so non-boolean inputs (e.g. `data.use_parallel_spacelike: "yes"`
from a YAML edit) fail loudly instead of silently flowing through
`bool(...)` casts as truthy.
Tests: `code/tests/test_public_config.py` gets four new cases pinning
the contract -- accept + reject-non-bool for each of the two flags.
The existing 19 test_public_config cases continue to pass.
Signed-off-by: kvmto <kmato@nvidia.com>
Explain the invariant-preserving behavior without implying bit-identical sequential output, and surface use_compile beside the public acceleration flag. Signed-off-by: kvmto <kmato@nvidia.com>
ivanbasov
reviewed
May 11, 2026
Collaborator
ivanbasov
left a comment
There was a problem hiding this comment.
Local review of the public-repo PR. The pack-once regression from the private PR is properly fixed (verified via test_compiled_parallel_reads_pre_packed_partition_off_cache — that test's docstring even cites the prior bug, which is the right shape). Inline comments below cover five items I'd want addressed; none block merge in my view.
ivanbasov
approved these changes
May 11, 2026
Clarify public flag validation, non-bit-identical HE outputs, and torch.compile cold-start behavior before merge. Signed-off-by: kvmto <kmato@nvidia.com>
bmhowe23
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds optional
data.use_parallel_spacelikesupport for Torch surface-code HE. When enabled withdata.use_compile=True, spacelike HE uses a validated 2-partition of the stabilizer-overlap graph so independent stabilizers can run in parallel through a compile-friendly path.This threads the option through training,
QCDataGeneratorTorch,MemoryCircuitTorch, public config validation, docs, tests, and GPU CI.Performance
Internal d=13/r=13 benchmark on 4x B200 with identical 25p noise model:
Testing