Conversation
…for fresh fills with simple calls. determinism obviously remains if an rng is passed
…mittent test failure. fence Float64 tests to CPU-only to avoid breaking CI on Metal and oneAPI
src/rand/rand.jl
Outdated
|
|
||
|
|
||
| function CounterRNG(; alg::CounterRNGAlgorithm=Philox()) | ||
| CounterRNG(Random.rand(Random.default_rng(), UInt64); alg) |
There was a problem hiding this comment.
| CounterRNG(Random.rand(Random.default_rng(), UInt64); alg) | |
| CounterRNG(Random.rand(UInt64); alg) |
There was a problem hiding this comment.
Yes, agreed on this - I had originally written the default_rng call as to be explicit about the rng used, but that is moot anyway. This is now Base.rand(UInt64)
(yes Base not Random...)
|
Foremost, thank you, this looks like quite some work! One of the things that make this a bit more challenging is that it implements Random number generation "from Scratch", perhaps you could reuse some of the work in Is there a particular reason why you choose to implement this in AcceleratedKernels.jl and not GPUArrays.jl? (Appologies if there was a prior conversation that I missed). To make sure I understand this PR correctly, it currently uses a random seed drawn per @maleadt might have some comments as well, |
…en AK.rand!(x) is called. This gives stream-like behaviour without any on-device state but still faster than CURAND. Add RNGTest in prototyping for BigCrush and SmallCrush (Philox, SM64, and Threefry all now confirmed to pass in the AK implementation)
Thanks very much for the time and review @vchuravy :) I shall answer in order! On Random123.jl: I considered reusing/introducing as a dep, but I wanted direct control over the algorithms, round counts, and lane widths, and I was not happy with the structure required to use the existing Philox2x / Philox4x structs in Random123.jl. It could've been done, but in practice, inlining the relevant Philox/Threefry code in AK was not much code, and it let the implementation fit very naturally into the rand_uint -> rand_scalar pipeline for producing any requested scalar type. No previous discussion about rand! in AK vs GPUArrays AFAIK, but for me it makes far more sense to include in AK. It's intended to be backend-agnostic user-facing (rather than an array interface primitive), which I feel fits quite nicely with the rest of AK. And yes! When you replied to this PR, that was exactly the functionality. However, I've now developed that a bit further so that I don't feel that there is a perfect way to do this, and am very keen to hopefully hear @maleadt 's thoughts on it, but I think that is the best way to implement an RNG that has no persistent state on-device. My understanding is that in CUDA.jl the decision was made to have a persistent advancing warp-level state with two counters: one per-warp and one per-thread, and that this decision was made to avoid the excessive register pressure that would come with persistent thread-level state. I do think this makes sense if one insists on maintaining rng state on-device, but I don't think that is necessary! And I personally think it behaves slightly strangely, as I believe the determinism is dependent on how the array is parallelised? With my implementation the random elements depend purely on their array index. I think with the current implementation (persistent host-side offset), we get the best of both worlds, i.e. streaming behaviour that most users would expect, and near-optimal performance (memory-bound on both my CUDA and oneAPI tests). I have confirmed 16 threads/register with Nsight compute (compared with 36 for CURAND which spills), nowhere near spilling. I have added a BigCrush test (see src/prototype/RNGTest/) which SplitMix64, Philox, and Threefry all pass. Honestly, because of the speed and (now proven) thoroughness, I don't see much point in including SM64 or Threefry (4x slower), but they are nice for completeness. Very open to any feedback or discussion! Particularly relating to opinions on how best to handle the pseudo-streaming behaviour, stateless vs slightly host-stateful etc - clearly we want to have 100% determinism where required but without sacrificing convenience. Thanks again Benchmarks(n.b. CUDA.rand! with a DenseCuArray{Float32} will forward to CURAND):
BenchmarkTools.Trial: 4338 samples with 1 evaluation per sample.
Range (min … max): 1.077 ms … 25.152 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.117 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.145 ms ± 597.561 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 592 bytes, allocs estimate: 37.
BenchmarkTools.Trial: 4731 samples with 1 evaluation per sample.
Range (min … max): 988.900 μs … 23.319 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.031 ms ┊ GC (median): 0.00%
Time (mean ± σ): 1.051 ms ± 477.906 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 1.75 KiB, allocs estimate: 54. |
This PR introduces
rand!support in AcceleratedKernels for in-place random filling on CPU and GPU arrays. The new API fills existing arrays directly (rand!) and supports deterministic seeded generation throughCounterRNG, which holds an offset which is advanced by length(x) on each call - to produce streaming RNG behaviour.Supported scalar element types are:
UInt8,UInt16,UInt32,UInt64Int8,Int16,Int32,Int64Float16,Float32,Float64Integer outputs are produced from random bit patterns, and float outputs are produced in
[0, 1)using mantissa-based construction.Three counter-based algorithms are included:
SplitMix64,Philox, andThreefry.Philoxis the default because it has stronger statistical pedigree while maintaining comparable throughput (390 GB/s compared with CUDA.rand!'s 382 GB/s on an RTX 5060).Implementation-wise, generation is built around raw Unsigned outputs (
UInt32/UInt64) and a shared conversion path to the requested element type.rand!uses a single backend-agnostic parallel fill path with a stable linear counter mapping (UInt64(i-1)), so results are reproducible for fixed seed, algorithm, shape, and eltype.The PR also adds convenience construction (
CounterRNG()auto-seeded once) and convenience filling (rand!(x)creates a fresh RNG per call), plus docs and a full rand test suite structured around constructors, bit conversion, scalar generation, and array fill behavior.Tested locally with CUDA and oneAPI