Skip to content

add citor runtime#14

Merged
tzcnt merged 2 commits into
tzcnt:mainfrom
Lallapallooza:add-citor
May 20, 2026
Merged

add citor runtime#14
tzcnt merged 2 commits into
tzcnt:mainfrom
Lallapallooza:add-citor

Conversation

@Lallapallooza
Copy link
Copy Markdown
Contributor

Adds the four canonical benchmarks (fib, skynet, nqueens, matmul) wired against citor v0.4.5 via CPM. Mirrors libfork's source shape: same N, same recursion structure, same validators, same warmup + timing scope.

Linux + Windows presets only, citor does not currently support ARM or macOS.

Adds the four canonical benchmarks (fib, skynet, nqueens, matmul)
wired against citor v0.4.5 via CPM. Mirrors libfork's source shape:
same N, same recursion structure, same validators, same warmup +
timing scope.

Linux + Windows presets only; citor does not currently support ARM
or macOS.
@Lallapallooza
Copy link
Copy Markdown
Contributor Author

Lallapallooza commented May 19, 2026

Tried on a Ryzen 9 9950X3D, clang 21 -O3 -march=znver5 -mavx2 -mfma, libtcmalloc_minimal:

Runtime citor libfork TooManyCooks tbb
Mean Ratio to Best
(lower is better)
1.05x 1.91x 2.25x 3.02x
skynet 66404 us 97793 us 124709 us 303372 us
nqueens 222053 us 183640 us 189241 us 288862 us
fib(39) 127825 us 136704 us 214037 us 631889 us
matmul(2048) 61837 us 252671 us 272875 us 61529 us
Runtime TooManyCooks
Mean Ratio to Best
(lower is better)
1.00x
io_socket_st 245070 us
Runtime TooManyCooks_mt TooManyCooks_st_asio
Mean Ratio to Best
(lower is better)
1.00x 1.11x
channel 384155 us 427709 us

Peak Memory Usage (Max RSS)

Runtime libfork TooManyCooks TooManyCooks_st_asio TooManyCooks_mt tbb citor
skynet 16.87 MB 16.87 MB N/A N/A 16.87 MB 16.87 MB
nqueens 16.87 MB 16.87 MB N/A N/A 16.87 MB 16.87 MB
fib(39) 16.87 MB 16.87 MB N/A N/A 16.87 MB 16.87 MB
matmul(2048) 56.12 MB 56.27 MB N/A N/A 56.38 MB 56.23 MB
io_socket_st N/A 16.87 MB N/A N/A N/A N/A
channel N/A N/A 16.87 MB 16.87 MB N/A N/A

@Lallapallooza
Copy link
Copy Markdown
Contributor Author

@tzcnt please review

@Lallapallooza
Copy link
Copy Markdown
Contributor Author

Note: citor's default affinity caps at physical cores, so on an SMT box ThreadPool(32) uses 16 workers. Affinity::PerCpuSmtPair opts into all 32. Doesn't affect SMT-off hardware, and the ranking holds either way.

@tzcnt
Copy link
Copy Markdown
Owner

tzcnt commented May 20, 2026

So the benchmarks sweep from 1 to 32 threads on the 5950X for example. We'd want to set Affinity::PerCpuSmtPair only when threads == 32 then? Is there a simple way to detect this programmatically? For example with TMC you can query the physical core count

auto puCount = topo.pu_count();
and then check if the requested threads is equal to that. I know you have some hardware query mechanism inside citor but not sure if it's publicly exposed.

Actually since you don't support ARM at the moment I think we can just check if thread_count == std::thread::hardware_concurrency() (assuming that x86 machines all have SMT)

@tzcnt
Copy link
Copy Markdown
Owner

tzcnt commented May 20, 2026

Overall the performance is quite impressive. I'll do a full run on all my hardware and update the results ASAP. One thing to note, it hits a stack overflow when running fib 45, due to the recursive inline call to the last element of each group. I think you could work around this by tracking the inline call depth and forcing the stack to unwind when it passes a certain threshold by posting the last element to the local deque instead of calling it directly.

@Lallapallooza
Copy link
Copy Markdown
Contributor Author

Pushed the affinity fix you suggested. Each bench picks PerCpuSmtPair when thread_count == std::thread::hardware_concurrency() and PerCpu below that, so the top of the sweep uses every logical CPU. No-op on SMT-off hosts.

On fib(45): confirmed, real citor bug, and you diagnosed it correctly, a worker in the fork-join drain runs a stolen task that forks and re-enters the drain, so deep trees descend without unwinding. As a temporary workaround you can bump citor's worker stack with -DCITOR_WORKER_STACK_KIB=65536 (unused pages stay unmapped, so RSS is unaffected).

I'll land the proper depth-cap fix in a separate PR.

Select PerCpuSmtPair when the requested thread count equals
hardware_concurrency() so the full-thread sweep point uses every
logical CPU; citor's default PerCpu caps workers at the physical-core
count.

Set CITOR_WORKER_STACK_KIB=65536 so deep recursive fork-join does not
overflow the worker stack on high core counts.
@Lallapallooza
Copy link
Copy Markdown
Contributor Author

I've set CITOR_WORKER_STACK_KIB=65536 in the bench's citor CMakeLists, so deep recursion works out of the box here. The proper depth-cap fix in citor itself I'll land separately. Thanks for the finding!

@tzcnt tzcnt merged commit 66ab963 into tzcnt:main May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants