Speed up Stim Sampling with Faster Ref Sample#1036
Conversation
The number of words is of order 5 to 20. It's not nearly large enough for threads to work.
This seems surprisingly slow to me. I was expecting like a 100x speedup. Where is it spending its time? |
Some hand-wavy estimations (commenting out / hacking work being done, to just do less of the work) shows that it might save like ~15% (~20s out of the ~2m). I didn't actually try and see how it'd perform with actual multi-threading.
So, either the sim needs to go faster or writing results to disk need to go faster. Going through the big ol' switch/case statement in Going into |
It's also worth mentioning that it's not just |
Computing the reference sample shouldn't be calling the frame simulator at all. As an example, the pybind reference sample method looks like this: Replace |
|
Are you running the frame simulator because of sweep Paulis? In that case what we need to do is ignore then when performing m2d, but have computed for each sweep bit which detectors it flips. That can also be done by tortoise-and-hare stuff. I don't think there's anything in the initialization that needs to be taking minutes. It should be taking milliseconds. |
What? Calling
There's no sweeps. There's no m2d. |
|
Ah, okay if you actually do want the sample (...which the title of the bug does suggest that you do...) then yes the frame sim is needed |
Agreed. |
c4129dc to
a8c7f06
Compare
a8c7f06 to
d1d107a
Compare
This PR speeds up `stim sample` by switching the reference sample calculation from the `TableauSimulator` to the `ReferenceSampleTree`. Calculating the reference sample takes a large portion of the time for larger codes. Testing of performance for larger codes (disance 25 at 1M rounds) was done by building stim with `bazel build :stim`, then running the following CLI command: `time bazel-bin/stim --gen surface_code --task rotated_memory_x --distance 25 --rounds 1000000 --after_clifford_depolarization 0.001 | bazel-bin/stim sample --shots 10 --out_format=r8 > ./debug.r8` Metrics given are based on my machine (linux), but all metrics should be considered relative to eachother. The time taken for generating the circuit is considered trivial (< 0.1s). Before this change, this sample took ~7m 23s. With this change, this sample took ~2m 12s, a ~3.4x speedup (about as fast as not calculating a reference sample at all). I also looked into `FrameSimulator`'s logic to look for more speedup opportunities. The only real opportunity seen is to use multi-threading with worker threads. In particular, any of the overloads for `simd_bits_range_ref::for_each_word()` could likely benefit from being done in parallel across multiple worker threads. Async file IO (either using native `<aio.h>`/`OVERLAPPED`/etc, or hand-rolling queued writes where `putc()` is called from another thread) could also possibly help to bring down total sample duration. However, any multi-threaded work can be handled/discussed in another PR. Changes: * Added an overload for `ReferenceSampleTree::decompress_into()` that works with `simd_bits`. * Uses the `vector` overload (instead of using `operator[]` on the tree directly in the loop) as it is the roughly same speed when built normally, but much faster in debug (from what I saw). * Updated `stim::command_sample()` to use `ReferenceSampleTree` instead of `TableauSimulator` for calculating the reference sample. * The output sample is still fully expanded out into a flat `simd_bits` for use with the compare / file writing logic. * Adding `--skip_loop_folding` CLI flag to disable `ReferenceSampleTree`, falling back to `TableauSimulator`. * Updating `command_sample_help()` to document this new command.
d1d107a to
ad6d2f6
Compare
|
Thanks. |
|
No, don't look into multi threading. I looked into it in the past and it's really hard to get benefits within shots (except for transposing the tableau during measurements in the tableau simulator). And of course having a simulator per thread when doing bulk sampling. |
This PR speeds up
stim sampleby switching the reference sample calculation from theTableauSimulatorto theReferenceSampleTree. Calculating the reference sample takes a large portion of the time for larger codes.Testing of performance for larger codes (disance 25 at 1M rounds) was done by building stim with
bazel build :stim, then running the following CLI command:time bazel-bin/stim --gen surface_code --task rotated_memory_x --distance 25 --rounds 1000000 --after_clifford_depolarization 0.001 | bazel-bin/stim sample --shots 10 --out_format=r8 > ./debug.r8Metrics given are based on my machine (linux), but all metrics should be considered relative to eachother.
The time taken for generating the circuit is considered trivial (< 0.1s).
Before this change, this sample took ~7m 23s.
With this change, this sample took ~2m 12s, a ~3.4x speedup (about as fast as not calculating a reference sample at all).
I also looked into
FrameSimulator's logic to look for more speedup opportunities.The only real opportunity seen is to use multi-threading with worker threads.
In particular, any of the overloads for
simd_bits_range_ref::for_each_word()could likely benefit from being done in parallel across multiple worker threads.Async file IO (either using native
<aio.h>/OVERLAPPED/etc, or hand-rolling queued writes whereputc()is called from another thread) could also possibly help to bring down total sample duration.However, any multi-threaded work can be handled/discussed in another PR.
Changes:
ReferenceSampleTree::decompress_into()that works withsimd_bits.vectoroverload (instead of usingoperator[]on the tree directly in the loop) as it is the roughly same speed when built normally, but much faster in debug (from what I saw).stim::command_sample()to useReferenceSampleTreeinstead ofTableauSimulatorfor calculating the reference sample.simd_bitsfor use with the compare / file writing logic.--skip_loop_foldingCLI flag to disableReferenceSampleTree, falling back toTableauSimulator.command_sample_help()to document this new command.