Skip to content

Speed up Stim Sampling with Faster Ref Sample#1036

Merged
Strilanc merged 1 commit intoquantumlib:mainfrom
qec-pconner:speedUpSamplingWithFasterRefSample
Feb 4, 2026
Merged

Speed up Stim Sampling with Faster Ref Sample#1036
Strilanc merged 1 commit intoquantumlib:mainfrom
qec-pconner:speedUpSamplingWithFasterRefSample

Conversation

@qec-pconner
Copy link
Contributor

@qec-pconner qec-pconner commented Feb 3, 2026

This PR speeds up stim sample by switching the reference sample calculation from the TableauSimulator to the ReferenceSampleTree. Calculating the reference sample takes a large portion of the time for larger codes.

Testing of performance for larger codes (disance 25 at 1M rounds) was done by building stim with bazel build :stim, then running the following CLI command:
time bazel-bin/stim --gen surface_code --task rotated_memory_x --distance 25 --rounds 1000000 --after_clifford_depolarization 0.001 | bazel-bin/stim sample --shots 10 --out_format=r8 > ./debug.r8
Metrics given are based on my machine (linux), but all metrics should be considered relative to eachother.

The time taken for generating the circuit is considered trivial (< 0.1s).

Before this change, this sample took ~7m 23s.
With this change, this sample took ~2m 12s, a ~3.4x speedup (about as fast as not calculating a reference sample at all).

I also looked into FrameSimulator's logic to look for more speedup opportunities.
The only real opportunity seen is to use multi-threading with worker threads.
In particular, any of the overloads for simd_bits_range_ref::for_each_word() could likely benefit from being done in parallel across multiple worker threads.
Async file IO (either using native <aio.h>/OVERLAPPED/etc, or hand-rolling queued writes where putc() is called from another thread) could also possibly help to bring down total sample duration.
However, any multi-threaded work can be handled/discussed in another PR.

Changes:

  • Added an overload for ReferenceSampleTree::decompress_into() that works with simd_bits.
    • Uses the vector overload (instead of using operator[] on the tree directly in the loop) as it is the roughly same speed when built normally, but much faster in debug (from what I saw).
  • Updated stim::command_sample() to use ReferenceSampleTree instead of TableauSimulator for calculating the reference sample.
    • The output sample is still fully expanded out into a flat simd_bits for use with the compare / file writing logic.
  • Adding --skip_loop_folding CLI flag to disable ReferenceSampleTree, falling back to TableauSimulator.
    • Updating command_sample_help() to document this new command.

@Strilanc
Copy link
Collaborator

Strilanc commented Feb 3, 2026

In particular, any of the overloads for simd_bits_range_ref::for_each_word() could likely benefit from being done in parallel across multiple worker threads.

The number of words is of order 5 to 20. It's not nearly large enough for threads to work.

With this change, this sample took ~2m 12s, a ~3.4x speedup (about as fast as not calculating a reference sample at all).

This seems surprisingly slow to me. I was expecting like a 100x speedup. Where is it spending its time?

@qec-pconner
Copy link
Contributor Author

qec-pconner commented Feb 3, 2026

The number of words is of order 5 to 20. It's not nearly large enough for threads to work.

Some hand-wavy estimations (commenting out / hacking work being done, to just do less of the work) shows that it might save like ~15% (~20s out of the ~2m). I didn't actually try and see how it'd perform with actual multi-threading.

This seems surprisingly slow to me. I was expecting like a 100x speedup. Where is it spending its time?

rerun_frame_sim_while_streaming_measurements_to_disk() calls circuit.for_each_operation().
For a d25 x 1M round circuit, that's just going to take a while (it has to iterate over every single instruction to do the sim).
The only way to make that loop go any faster is to make the loop contents logic faster. The only two things in the loop are:

  • sim.do_gate(op);
  • sim.m_record.intermediate_write_unwritten_results_to(writer, reference_sample);

So, either the sim needs to go faster or writing results to disk need to go faster.

Going through the big ol' switch/case statement in do_gate(), I tried commenting out each thing to see how much of a max speed up I could get (if that respective operation took an ideal 0ns).
I saw that the biggest wins were for measurements (like MRZ) and conditional gates (like ZCX), which both make use of for_each_word().

Going into intermediate_write_unwritten_results_to(), I messed around with MeasureRecordWriterFormatR8's write_bytes(), write_bit(), and write_end().
Instead of putc(), I tried different syscalls, I tried batching writes; none of those really made that huge of a speed up (some had some, but not worth me putting up a PR).
Obviously, if I just comment out the putc(), there's a huge speedup -- so, if that work can be moved out to another thread, I'd reckon it'd help. Just pushing out the data would likely still be a big bottle neck (even if deferred via another thread, because the presumably used queue would be finite in size), but it'd probably be beneficial.
But, again, not something that I'm going to figure out right now.

@qec-pconner
Copy link
Contributor Author

qec-pconner commented Feb 3, 2026

The number of words is of order 5 to 20. It's not nearly large enough for threads to work.

It's also worth mentioning that it's not just for_each_word()'s size, but also the number of targets that are operated upon at once.
So, multi-threading could be used for performing all sub-operations (on each target / target pair) in parallel.
You'll end up seeing d^2 - 2*(d^2) number of targets on each instruction -- which, for large distance codes, is definitely something that is worth parallelizing (yes, much moreso than for_each_word() by itself, but por que no los dos? b̶e̶s̶i̶d̶e̶s̶ ̶t̶h̶e̶ ̶s̶i̶m̶p̶l̶e̶r̶ ̶i̶m̶p̶l̶ ̶:̶x̶).

@Strilanc
Copy link
Collaborator

Strilanc commented Feb 3, 2026

rerun_frame_sim_while_streaming_measurements_to_disk() calls circuit.for_each_operation().
For a d25 x 1M round circuit, that's just going to take a while (it has to iterate over every single instruction to do the sim).

Computing the reference sample shouldn't be calling the frame simulator at all.

As an example, the pybind reference sample method looks like this:

    c.def(
        "reference_sample",
        [](const Circuit &self, bool bit_packed) {
            auto ref = TableauSimulator<MAX_BITWORD_WIDTH>::reference_sample_circuit(self);
            simd_bits_range_ref<MAX_BITWORD_WIDTH> reference_sample(ref.ptr_simd, ref.num_simd_words);
            size_t num_measure = self.count_measurements();
            return simd_bits_to_numpy(reference_sample, num_measure, bit_packed);
        },
       ...

Replace TableauSimulator<MAX_BITWORD_WIDTH>::reference_sample_circuit(self) with the tree stuff and there's no frame simulator involved.

@Strilanc
Copy link
Collaborator

Strilanc commented Feb 3, 2026

Are you running the frame simulator because of sweep Paulis?

In that case what we need to do is ignore then when performing m2d, but have computed for each sweep bit which detectors it flips. That can also be done by tortoise-and-hare stuff.

I don't think there's anything in the initialization that needs to be taking minutes. It should be taking milliseconds.

@qec-pconner
Copy link
Contributor Author

Computing the reference sample shouldn't be calling the frame simulator at all.

... there's no frame simulator involved.

Are you running the frame simulator because of sweep Paulis?

What?
As described, I'm running stim sample ... on the CLI.
My whole goal here is to make sampling run faster (take less time) for any circuit, but especially large circuits (high distance, high round count).

Calling stim sample ... internally runs stim::command_sample() in src/stim/cmd/command_sample.cc (where all the logic for sampling lives).

stim::command_sample() will first generate a reference sample ... for the explicit purpose of being passed to a FrameSimulator instance (when it calls sample_batch_measurements_writing_results_to_disk()).

There's no sweeps. There's no m2d.
It's just sampling data, which takes non-zero time -- scaling with instruction count (and target count, per instruction), both of which scale with distance and/or round count).

@Strilanc
Copy link
Collaborator

Strilanc commented Feb 3, 2026

Ah, okay if you actually do want the sample (...which the title of the bug does suggest that you do...) then yes the frame sim is needed

@qec-pconner
Copy link
Contributor Author

qec-pconner commented Feb 3, 2026

Ah, okay if you actually do want the sample (...which the title of the bug does suggest that you do...) then yes the frame sim is needed

Agreed.

@qec-pconner qec-pconner force-pushed the speedUpSamplingWithFasterRefSample branch 3 times, most recently from c4129dc to a8c7f06 Compare February 4, 2026 00:39
@qec-pconner qec-pconner force-pushed the speedUpSamplingWithFasterRefSample branch from a8c7f06 to d1d107a Compare February 4, 2026 01:13
This PR speeds up `stim sample` by switching the reference sample calculation from the `TableauSimulator` to the `ReferenceSampleTree`.
Calculating the reference sample takes a large portion of the time for larger codes.

Testing of performance for larger codes (disance 25 at 1M rounds) was done by building stim with `bazel build :stim`, then running the following CLI command:
`time bazel-bin/stim --gen surface_code --task rotated_memory_x --distance 25 --rounds 1000000 --after_clifford_depolarization 0.001  | bazel-bin/stim sample --shots 10 --out_format=r8 > ./debug.r8`
Metrics given are based on my machine (linux), but all metrics should be considered relative to eachother.

The time taken for generating the circuit is considered trivial (< 0.1s).

Before this change, this sample took ~7m 23s.
With this change, this sample took ~2m 12s, a ~3.4x speedup (about as fast as not calculating a reference sample at all).

I also looked into `FrameSimulator`'s logic to look for more speedup opportunities.
The only real opportunity seen is to use multi-threading with worker threads.
In particular, any of the overloads for `simd_bits_range_ref::for_each_word()` could likely benefit from being done in parallel across multiple worker threads.
Async file IO (either using native `<aio.h>`/`OVERLAPPED`/etc, or hand-rolling queued writes where `putc()` is called from another thread) could also possibly help to bring down total sample duration.
However, any multi-threaded work can be handled/discussed in another PR.

Changes:
* Added an overload for `ReferenceSampleTree::decompress_into()` that works with `simd_bits`.
  * Uses the `vector` overload (instead of using `operator[]` on the tree directly in the loop) as it is the roughly same speed when built normally, but much faster in debug (from what I saw).
* Updated `stim::command_sample()` to use `ReferenceSampleTree` instead of `TableauSimulator` for calculating the reference sample.
  * The output sample is still fully expanded out into a flat `simd_bits` for use with the compare / file writing logic.
* Adding `--skip_loop_folding` CLI flag to disable `ReferenceSampleTree`, falling back to `TableauSimulator`.
  * Updating `command_sample_help()` to document this new command.
@qec-pconner qec-pconner force-pushed the speedUpSamplingWithFasterRefSample branch from d1d107a to ad6d2f6 Compare February 4, 2026 01:15
@qec-pconner
Copy link
Contributor Author

Thanks.
Let me know if you want me to look into speeding up FrameSim with multi-threading. I have an impl in mind and it'd probably reduce the duration down by a lot. (Could be used in any context, not just sampling).

@Strilanc
Copy link
Collaborator

Strilanc commented Feb 4, 2026

No, don't look into multi threading. I looked into it in the past and it's really hard to get benefits within shots (except for transposing the tableau during measurements in the tableau simulator). And of course having a simulator per thread when doing bulk sampling.

@Strilanc Strilanc merged commit 3a83081 into quantumlib:main Feb 4, 2026
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants