Skip to content

On-device rand!#80

Open
fjbarter wants to merge 10 commits intoJuliaGPU:mainfrom
fjbarter:rand_fill
Open

On-device rand!#80
fjbarter wants to merge 10 commits intoJuliaGPU:mainfrom
fjbarter:rand_fill

Conversation

@fjbarter
Copy link

@fjbarter fjbarter commented Mar 16, 2026

This PR introduces rand! support in AcceleratedKernels for in-place random filling on CPU and GPU arrays. The new API fills existing arrays directly (rand!) and supports deterministic seeded generation through CounterRNG, which holds an offset which is advanced by length(x) on each call - to produce streaming RNG behaviour.

Supported scalar element types are:

  • UInt8, UInt16, UInt32, UInt64
  • Int8, Int16, Int32, Int64
  • Float16, Float32, Float64

Integer outputs are produced from random bit patterns, and float outputs are produced in [0, 1) using mantissa-based construction.

Three counter-based algorithms are included: SplitMix64, Philox, and Threefry. Philox is the default because it has stronger statistical pedigree while maintaining comparable throughput (390 GB/s compared with CUDA.rand!'s 382 GB/s on an RTX 5060).

Implementation-wise, generation is built around raw Unsigned outputs (UInt32 / UInt64) and a shared conversion path to the requested element type. rand! uses a single backend-agnostic parallel fill path with a stable linear counter mapping (UInt64(i-1)), so results are reproducible for fixed seed, algorithm, shape, and eltype.

The PR also adds convenience construction (CounterRNG() auto-seeded once) and convenience filling (rand!(x) creates a fresh RNG per call), plus docs and a full rand test suite structured around constructors, bit conversion, scalar generation, and array fill behavior.

Tested locally with CUDA and oneAPI

src/rand/rand.jl Outdated


function CounterRNG(; alg::CounterRNGAlgorithm=Philox())
CounterRNG(Random.rand(Random.default_rng(), UInt64); alg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CounterRNG(Random.rand(Random.default_rng(), UInt64); alg)
CounterRNG(Random.rand(UInt64); alg)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed on this - I had originally written the default_rng call as to be explicit about the rng used, but that is moot anyway. This is now Base.rand(UInt64)
(yes Base not Random...)

@vchuravy
Copy link
Member

Foremost, thank you, this looks like quite some work! One of the things that make this a bit more challenging is that it implements Random number generation "from Scratch", perhaps you could reuse some of the work in
Random123.jl and RandomNumbers.jl ? (I think this is what CUDA.jl uses, see see JuliaGPU/CUDA.jl#882).

Is there a particular reason why you choose to implement this in AcceleratedKernels.jl and not GPUArrays.jl? (Appologies if there was a prior conversation that I missed).

To make sure I understand this PR correctly, it currently uses a random seed drawn per AK.rand!(A), but repeated calls using the same CounterRng are reproducible.

rng = CounterRng()
AK.rand!(rng, A)
AK.rand!(rng, B)
A == B

@maleadt might have some comments as well,

…en AK.rand!(x) is called. This gives stream-like behaviour without any on-device state but still faster than CURAND. Add RNGTest in prototyping for BigCrush and SmallCrush (Philox, SM64, and Threefry all now confirmed to pass in the AK implementation)
@fjbarter
Copy link
Author

Foremost, thank you, this looks like quite some work! One of the things that make this a bit more challenging is that it implements Random number generation "from Scratch", perhaps you could reuse some of the work in Random123.jl and RandomNumbers.jl ? (I think this is what CUDA.jl uses, see see JuliaGPU/CUDA.jl#882).

Is there a particular reason why you choose to implement this in AcceleratedKernels.jl and not GPUArrays.jl? (Appologies if there was a prior conversation that I missed).

To make sure I understand this PR correctly, it currently uses a random seed drawn per AK.rand!(A), but repeated calls using the same CounterRng are reproducible.

rng = CounterRng()
AK.rand!(rng, A)
AK.rand!(rng, B)
A == B

@maleadt might have some comments as well,

Thanks very much for the time and review @vchuravy :)

I shall answer in order!

On Random123.jl: I considered reusing/introducing as a dep, but I wanted direct control over the algorithms, round counts, and lane widths, and I was not happy with the structure required to use the existing Philox2x / Philox4x structs in Random123.jl. It could've been done, but in practice, inlining the relevant Philox/Threefry code in AK was not much code, and it let the implementation fit very naturally into the rand_uint -> rand_scalar pipeline for producing any requested scalar type.

No previous discussion about rand! in AK vs GPUArrays AFAIK, but for me it makes far more sense to include in AK. It's intended to be backend-agnostic user-facing (rather than an array interface primitive), which I feel fits quite nicely with the rest of AK.

And yes! When you replied to this PR, that was exactly the functionality. However, I've now developed that a bit further so that AK.rand! behaves like a streaming RNG, via the implementation of a tiny host-side mutable state, which holds an offset used to advance the 'location' of the stream in the RNG (explained slightly better in the docs...). So effectively, calling AK.rand! on a 100-element vector twice would give the same 200 numbers as calling it on one 200-element vector.

I don't feel that there is a perfect way to do this, and am very keen to hopefully hear @maleadt 's thoughts on it, but I think that is the best way to implement an RNG that has no persistent state on-device. My understanding is that in CUDA.jl the decision was made to have a persistent advancing warp-level state with two counters: one per-warp and one per-thread, and that this decision was made to avoid the excessive register pressure that would come with persistent thread-level state. I do think this makes sense if one insists on maintaining rng state on-device, but I don't think that is necessary! And I personally think it behaves slightly strangely, as I believe the determinism is dependent on how the array is parallelised? With my implementation the random elements depend purely on their array index.

I think with the current implementation (persistent host-side offset), we get the best of both worlds, i.e. streaming behaviour that most users would expect, and near-optimal performance (memory-bound on both my CUDA and oneAPI tests). I have confirmed 16 threads/register with Nsight compute (compared with 36 for CURAND which spills), nowhere near spilling.

I have added a BigCrush test (see src/prototype/RNGTest/) which SplitMix64, Philox, and Threefry all pass. Honestly, because of the speed and (now proven) thoroughness, I don't see much point in including SM64 or Threefry (4x slower), but they are nice for completeness.

Very open to any feedback or discussion! Particularly relating to opinions on how best to handle the pseudo-streaming behaviour, stateless vs slightly host-stateful etc - clearly we want to have 100% determinism where required but without sacrificing convenience.

Thanks again

Benchmarks

(n.b. CUDA.rand! with a DenseCuArray{Float32} will forward to CURAND):

CUDA.rand! benchmark (CuArray{Float32}, in-place)

BenchmarkTools.Trial: 4338 samples with 1 evaluation per sample.
 Range (min  max):  1.077 ms   25.152 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.117 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.145 ms ± 597.561 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 592 bytes, allocs estimate: 37.

AK.rand! Philox benchmark (GPU, CuArray{Float32}) - 7 Philox2x32 rounds (bare minimum to pass BigCrush), although 10 is only a tiny bit faster

BenchmarkTools.Trial: 4731 samples with 1 evaluation per sample.
 Range (min  max):  988.900 μs   23.319 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):       1.031 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):     1.051 ms ± 477.906 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 Memory estimate: 1.75 KiB, allocs estimate: 54.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants