On-device rand! by fjbarter · Pull Request #80 · JuliaGPU/AcceleratedKernels.jl

fjbarter · 2026-03-16T18:58:57Z

This PR introduces rand! support in AcceleratedKernels for in-place random filling on CPU and GPU arrays. The new API fills existing arrays directly (rand!) and supports deterministic seeded generation through CounterRNG, which holds an offset which is advanced by length(x) on each call - to produce streaming RNG behaviour.

Supported scalar element types are:

UInt8, UInt16, UInt32, UInt64
Int8, Int16, Int32, Int64
Float16, Float32, Float64

Integer outputs are produced from random bit patterns, and float outputs are produced in [0, 1) using mantissa-based construction.

Three counter-based algorithms are included: SplitMix64, Philox, and Threefry. Philox is the default because it has stronger statistical pedigree while maintaining comparable throughput (390 GB/s compared with CUDA.rand!'s 382 GB/s on an RTX 5060).

Implementation-wise, generation is built around raw Unsigned outputs (UInt32 / UInt64) and a shared conversion path to the requested element type. rand! uses a single backend-agnostic parallel fill path with a stable linear counter mapping (UInt64(i-1)), so results are reproducible for fixed seed, algorithm, shape, and eltype.

The PR also adds convenience construction (CounterRNG() auto-seeded once) and convenience filling (rand!(x) creates a fresh RNG per call), plus docs and a full rand test suite structured around constructors, bit conversion, scalar generation, and array fill behavior.

Tested locally with CUDA and oneAPI

…for fresh fills with simple calls. determinism obviously remains if an rng is passed

…mittent test failure. fence Float64 tests to CPU-only to avoid breaking CI on Metal and oneAPI

…ss of docs

vchuravy · 2026-03-19T08:37:33Z

src/rand/rand.jl

+
+
+function CounterRNG(; alg::CounterRNGAlgorithm=Philox())
+    CounterRNG(Random.rand(Random.default_rng(), UInt64); alg)


Suggested change

CounterRNG(Random.rand(Random.default_rng(), UInt64); alg)

CounterRNG(Random.rand(UInt64); alg)

Yes, agreed on this - I had originally written the default_rng call as to be explicit about the rng used, but that is moot anyway. This is now Base.rand(UInt64)
(yes Base not Random...)

vchuravy · 2026-03-19T08:39:44Z

Foremost, thank you, this looks like quite some work! One of the things that make this a bit more challenging is that it implements Random number generation "from Scratch", perhaps you could reuse some of the work in
Random123.jl and RandomNumbers.jl ? (I think this is what CUDA.jl uses, see see JuliaGPU/CUDA.jl#882).

Is there a particular reason why you choose to implement this in AcceleratedKernels.jl and not GPUArrays.jl? (Appologies if there was a prior conversation that I missed).

To make sure I understand this PR correctly, it currently uses a random seed drawn per AK.rand!(A), but repeated calls using the same CounterRng are reproducible.

rng = CounterRng()
AK.rand!(rng, A)
AK.rand!(rng, B)
A == B

@maleadt might have some comments as well,

…en AK.rand!(x) is called. This gives stream-like behaviour without any on-device state but still faster than CURAND. Add RNGTest in prototyping for BigCrush and SmallCrush (Philox, SM64, and Threefry all now confirmed to pass in the AK implementation)

fjbarter · 2026-03-23T02:27:25Z

Foremost, thank you, this looks like quite some work! One of the things that make this a bit more challenging is that it implements Random number generation "from Scratch", perhaps you could reuse some of the work in Random123.jl and RandomNumbers.jl ? (I think this is what CUDA.jl uses, see see JuliaGPU/CUDA.jl#882).

Is there a particular reason why you choose to implement this in AcceleratedKernels.jl and not GPUArrays.jl? (Appologies if there was a prior conversation that I missed).

To make sure I understand this PR correctly, it currently uses a random seed drawn per AK.rand!(A), but repeated calls using the same CounterRng are reproducible.
rng = CounterRng()
AK.rand!(rng, A)
AK.rand!(rng, B)
A == B
@maleadt might have some comments as well,

Thanks very much for the time and review @vchuravy :)

I shall answer in order!

On Random123.jl: I considered reusing/introducing as a dep, but I wanted direct control over the algorithms, round counts, and lane widths, and I was not happy with the structure required to use the existing Philox2x / Philox4x structs in Random123.jl. It could've been done, but in practice, inlining the relevant Philox/Threefry code in AK was not much code, and it let the implementation fit very naturally into the rand_uint -> rand_scalar pipeline for producing any requested scalar type.

No previous discussion about rand! in AK vs GPUArrays AFAIK, but for me it makes far more sense to include in AK. It's intended to be backend-agnostic user-facing (rather than an array interface primitive), which I feel fits quite nicely with the rest of AK.

And yes! When you replied to this PR, that was exactly the functionality. However, I've now developed that a bit further so that AK.rand! behaves like a streaming RNG, via the implementation of a tiny host-side mutable state, which holds an offset used to advance the 'location' of the stream in the RNG (explained slightly better in the docs...). So effectively, calling AK.rand! on a 100-element vector twice would give the same 200 numbers as calling it on one 200-element vector.

I don't feel that there is a perfect way to do this, and am very keen to hopefully hear @maleadt 's thoughts on it, but I think that is the best way to implement an RNG that has no persistent state on-device. My understanding is that in CUDA.jl the decision was made to have a persistent advancing warp-level state with two counters: one per-warp and one per-thread, and that this decision was made to avoid the excessive register pressure that would come with persistent thread-level state. I do think this makes sense if one insists on maintaining rng state on-device, but I don't think that is necessary! And I personally think it behaves slightly strangely, as I believe the determinism is dependent on how the array is parallelised? With my implementation the random elements depend purely on their array index.

I think with the current implementation (persistent host-side offset), we get the best of both worlds, i.e. streaming behaviour that most users would expect, and near-optimal performance (memory-bound on both my CUDA and oneAPI tests). I have confirmed 16 threads/register with Nsight compute (compared with 36 for CURAND which spills), nowhere near spilling.

I have added a BigCrush test (see src/prototype/RNGTest/) which SplitMix64, Philox, and Threefry all pass. Honestly, because of the speed and (now proven) thoroughness, I don't see much point in including SM64 or Threefry (4x slower), but they are nice for completeness.

Very open to any feedback or discussion! Particularly relating to opinions on how best to handle the pseudo-streaming behaviour, stateless vs slightly host-stateful etc - clearly we want to have 100% determinism where required but without sacrificing convenience.

Thanks again

Benchmarks

(n.b. CUDA.rand! with a DenseCuArray{Float32} will forward to CURAND):

CUDA.rand! benchmark (CuArray{Float32}, in-place)

BenchmarkTools.Trial: 4338 samples with 1 evaluation per sample.
 Range (min … max):  1.077 ms …  25.152 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.117 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.145 ms ± 597.561 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 592 bytes, allocs estimate: 37.

AK.rand! Philox benchmark (GPU, CuArray{Float32}) - 7 Philox2x32 rounds (bare minimum to pass BigCrush), although 10 is only a tiny bit faster

BenchmarkTools.Trial: 4731 samples with 1 evaluation per sample.
 Range (min … max):  988.900 μs …  23.319 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):       1.031 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.051 ms ± 477.906 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 1.75 KiB, allocs estimate: 54.

fjbarter added 8 commits March 16, 2026 01:06

on-device rand! with three options for stateless counter-based RNG

20eea2c

add convenience for AK.rand!(X), and single UInt64 seed construction …

3c3e81a

…for fresh fills with simple calls. determinism obviously remains if an rng is passed

just use foreachindex for the kernel now that it is simple enough

f590d5e

finalise tests and docs

9e471d6

ensure deteminism in convenience tests by avoiding race causing inter…

65caf10

…mittent test failure. fence Float64 tests to CPU-only to avoid breaking CI on Metal and oneAPI

fix Random dep for use with julia v1.10

3d1e0c0

add Bool scalar support with isodd() on a UInt32

48cd247

add support for Float16, UInt8, UInt16, Int8, Int16 (was bored)

68e03c3

fjbarter force-pushed the rand_fill branch from e1d5153 to 68e03c3 Compare March 17, 2026 23:14

Add more faithful seed-key mapping for Philox, and improve thoroughne…

8677d84

…ss of docs

vchuravy reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-device rand!#80

On-device rand!#80
fjbarter wants to merge 10 commits intoJuliaGPU:mainfrom
fjbarter:rand_fill

fjbarter commented Mar 16, 2026 •

edited

Loading

Uh oh!

vchuravy Mar 19, 2026

Uh oh!

fjbarter Mar 23, 2026

Uh oh!

vchuravy commented Mar 19, 2026

Uh oh!

fjbarter commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		function CounterRNG(; alg::CounterRNGAlgorithm=Philox())
		CounterRNG(Random.rand(Random.default_rng(), UInt64); alg)

	CounterRNG(Random.rand(Random.default_rng(), UInt64); alg)
	CounterRNG(Random.rand(UInt64); alg)

Conversation

fjbarter commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vchuravy Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

fjbarter Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

vchuravy commented Mar 19, 2026

Uh oh!

fjbarter commented Mar 23, 2026

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fjbarter commented Mar 16, 2026 •

edited

Loading