add citor runtime by Lallapallooza · Pull Request #14 · tzcnt/runtime-benchmarks

Lallapallooza · 2026-05-19T20:17:24Z

Adds the four canonical benchmarks (fib, skynet, nqueens, matmul) wired against citor v0.4.5 via CPM. Mirrors libfork's source shape: same N, same recursion structure, same validators, same warmup + timing scope.

Linux + Windows presets only, citor does not currently support ARM or macOS.

Adds the four canonical benchmarks (fib, skynet, nqueens, matmul) wired against citor v0.4.5 via CPM. Mirrors libfork's source shape: same N, same recursion structure, same validators, same warmup + timing scope. Linux + Windows presets only; citor does not currently support ARM or macOS.

Lallapallooza · 2026-05-19T20:20:25Z

Tried on a Ryzen 9 9950X3D, clang 21 -O3 -march=znver5 -mavx2 -mfma, libtcmalloc_minimal:

Runtime	citor	libfork	TooManyCooks	tbb
Mean Ratio to Best (lower is better)	1.05x	1.91x	2.25x	3.02x
skynet	66404 us	97793 us	124709 us	303372 us
nqueens	222053 us	183640 us	189241 us	288862 us
fib(39)	127825 us	136704 us	214037 us	631889 us
matmul(2048)	61837 us	252671 us	272875 us	61529 us

Runtime	TooManyCooks
Mean Ratio to Best (lower is better)	1.00x
io_socket_st	245070 us

Runtime	TooManyCooks_mt	TooManyCooks_st_asio
Mean Ratio to Best (lower is better)	1.00x	1.11x
channel	384155 us	427709 us

Peak Memory Usage (Max RSS)

Runtime	libfork	TooManyCooks	TooManyCooks_st_asio	TooManyCooks_mt	tbb	citor
skynet	16.87 MB	16.87 MB	N/A	N/A	16.87 MB	16.87 MB
nqueens	16.87 MB	16.87 MB	N/A	N/A	16.87 MB	16.87 MB
fib(39)	16.87 MB	16.87 MB	N/A	N/A	16.87 MB	16.87 MB
matmul(2048)	56.12 MB	56.27 MB	N/A	N/A	56.38 MB	56.23 MB
io_socket_st	N/A	16.87 MB	N/A	N/A	N/A	N/A
channel	N/A	N/A	16.87 MB	16.87 MB	N/A	N/A

Lallapallooza · 2026-05-19T20:20:37Z

@tzcnt please review

Lallapallooza · 2026-05-20T00:04:49Z

Note: citor's default affinity caps at physical cores, so on an SMT box ThreadPool(32) uses 16 workers. Affinity::PerCpuSmtPair opts into all 32. Doesn't affect SMT-off hardware, and the ranking holds either way.

tzcnt · 2026-05-20T00:34:55Z

So the benchmarks sweep from 1 to 32 threads on the 5950X for example. We'd want to set Affinity::PerCpuSmtPair only when threads == 32 then? Is there a simple way to detect this programmatically? For example with TMC you can query the physical core count

runtime-benchmarks/cpp/TooManyCooks/threads_sweep.cpp

Line 33 in ed983af

auto puCount = topo.pu_count();

and then check if the requested threads is equal to that. I know you have some hardware query mechanism inside citor but not sure if it's publicly exposed.

Actually since you don't support ARM at the moment I think we can just check if thread_count == std::thread::hardware_concurrency() (assuming that x86 machines all have SMT)

tzcnt · 2026-05-20T00:38:55Z

Overall the performance is quite impressive. I'll do a full run on all my hardware and update the results ASAP. One thing to note, it hits a stack overflow when running fib 45, due to the recursive inline call to the last element of each group. I think you could work around this by tracking the inline call depth and forcing the stack to unwind when it passes a certain threshold by posting the last element to the local deque instead of calling it directly.

Lallapallooza · 2026-05-20T01:56:25Z

Pushed the affinity fix you suggested. Each bench picks PerCpuSmtPair when thread_count == std::thread::hardware_concurrency() and PerCpu below that, so the top of the sweep uses every logical CPU. No-op on SMT-off hosts.

On fib(45): confirmed, real citor bug, and you diagnosed it correctly, a worker in the fork-join drain runs a stolen task that forks and re-enters the drain, so deep trees descend without unwinding. As a temporary workaround you can bump citor's worker stack with -DCITOR_WORKER_STACK_KIB=65536 (unused pages stay unmapped, so RSS is unaffected).

I'll land the proper depth-cap fix in a separate PR.

Select PerCpuSmtPair when the requested thread count equals hardware_concurrency() so the full-thread sweep point uses every logical CPU; citor's default PerCpu caps workers at the physical-core count. Set CITOR_WORKER_STACK_KIB=65536 so deep recursive fork-join does not overflow the worker stack on high core counts.

Lallapallooza · 2026-05-20T02:01:20Z

I've set CITOR_WORKER_STACK_KIB=65536 in the bench's citor CMakeLists, so deep recursion works out of the box here. The proper depth-cap fix in citor itself I'll land separately. Thanks for the finding!

Lallapallooza force-pushed the add-citor branch from e7b2013 to c479789 Compare May 20, 2026 01:59

tzcnt merged commit 66ab963 into tzcnt:main May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add citor runtime#14

add citor runtime#14
tzcnt merged 2 commits into
tzcnt:mainfrom
Lallapallooza:add-citor

Lallapallooza commented May 19, 2026

Uh oh!

Lallapallooza commented May 19, 2026 •

edited

Loading

Uh oh!

Lallapallooza commented May 19, 2026

Uh oh!

Lallapallooza commented May 20, 2026

Uh oh!

tzcnt commented May 20, 2026

Uh oh!

tzcnt commented May 20, 2026 •

edited

Loading

Uh oh!

Lallapallooza commented May 20, 2026

Uh oh!

Lallapallooza commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lallapallooza commented May 19, 2026

Uh oh!

Lallapallooza commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Peak Memory Usage (Max RSS)

Uh oh!

Lallapallooza commented May 19, 2026

Uh oh!

Lallapallooza commented May 20, 2026

Uh oh!

tzcnt commented May 20, 2026

Uh oh!

tzcnt commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lallapallooza commented May 20, 2026

Uh oh!

Lallapallooza commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Lallapallooza commented May 19, 2026 •

edited

Loading

tzcnt commented May 20, 2026 •

edited

Loading