Releases: JuliaGPU/AcceleratedKernels.jl
Releases · JuliaGPU/AcceleratedKernels.jl
v0.4.3
AcceleratedKernels v0.4.3
- Made ScanPrefixes the default accumulate / cumsum / cumprod algorithm. It is almost always faster on real-world data than DecoupledLookback, and doesn't depend on cross-block communication (even though theoretically DecoupledLookback has better asymptotic scalability).
- Prepared AcceleratedKernels for the future PoCL backend becoming the KernelAbstractions CPU default backend; the Threads-based algorithms will remain the defaults until PoCL ones become faster.
- A lot of housekeeping.
Merged pull requests:
- Typo in
accumulatebenchmarks (#42) (@christiangnrd) - Use UnsafeAtomics to fix race in accumulate (#44) (@vchuravy)
- Stop relying on backend type to determine algorithm used (#45) (@christiangnrd)
- Test both 1d
accumulatealgorithms when supported (#49) (@christiangnrd) neutral_elementfixes (#52) (@christiangnrd)- Deduplicate
reduce_group(#55) (@christiangnrd) - Tweak backend selection (#56) (@christiangnrd)
- Vc/accumulate alg: made ScanPrefixes the default accumulate algorithm; added atomic orderings to DecoupledLookback. (#57) (@anicusan)
Closed issues:
- Port over GPUArrays
neutral_elementfixes (#51)
v0.4.2
AcceleratedKernels v0.4.2
Changes
- Change the default accumulate algorithm to
ScanPrefixes - Fix a logic bug in accumalte_nd
Merged pull requests:
- Fix for
accumulateby block (#47) (@christiangnrd) - Switch default algorithm for accumulate to ScanPrefixes (#48) (@vchuravy)
Closed issues:
- Wrong
cumsumresults on ROCBackend (#41)
v0.4.1
AcceleratedKernels v0.4.1
Merged pull requests:
- Address #37 (#38) (@christiangnrd)
- [NFC] Reduce kwarg duplication (#39) (@christiangnrd)
Closed issues:
accumulatealgorithm selection for Metal implementation being overridden (#37)
v0.4.0
AcceleratedKernels v0.4.0
- Added multithreaded versions to all algorithms:
sort(a parallel sample sort deferring to the Julia Base sort for independent slices),mapreduceandaccumulate(including N-dimensional reductions).sortscales quite well for a problem with such heavy data dependencies - depending on the problem, we get even 75% strong scaling (e.g.sortpermon UInt32).mapreducehas almost perfect scaling.accumulateon a single thread is not better than the Base one, especially on N-dimensional cases. This is mainly due to calculating index offsets for each outer element - something to improve. It scales well though, and becomes faster at 3-4 threads.
- Removed the Polyester and OhMyThreads dependencies - now AcceleratedKernels does not bring in any backend stack, only the backend-agnostic KernelAbstractions, GPUArraysCore and ArgCheck. It is quite minimal in its dependencies, and this should help its potential use within GPUArrays.
Breaking changes
- Removed the
schedulerkeyword for the multithreaded backend which was from Polyester and OhMyThreads. To update code, simply remove theschedulerkwarg, and the Base Threads will be used (which are extremely fast - with the same performance as OMT or Polyester, but more composable in layered multi-threaded code)
Notes
- For the GPU backend, there are no breaking changes.
Merged pull requests:
- Fix
accumulate!default argument bug (#31) (@christiangnrd) - Remove GPUArrays dependency (#32) (@christiangnrd)
- Multithreaded CPU sample sort (#34) (@anicusan)
- [NFC] Split up tests into different files (#35) (@christiangnrd)
- [Metal] Use safe
block_sizeinaccumulate(#36) (@christiangnrd) - Added multithreaded implementations of 1D and ND accumulate, mapreduce. Removed Polyester and OhMyThreads dependencies. (#40) (@anicusan)
v0.3.3
v0.3.2
AcceleratedKernels v0.3.2
Merged pull requests:
v0.3.1
v0.3.0
AcceleratedKernels v0.3.0
Breaking changes
- Respecting the
initvalue formapreduceandaccumulatewhen it is not the binary operator'sopzero required the introduction ofneutral=GPUArrays.neutral_element(f, T)as a keyword argument, which may break some codes with customffunctions, hence the version increment. - If we're doing breaking changes anyways, we changed the
anyandallfunctions signatures to the consistentalgspecification.
Merged pull requests:
- CompatHelper: bump compat for oneAPI in [weakdeps] to 2, (keep existing compat) (#18) (@anicusan)
- mark values used in localmem initialization as uniform (#19) (@vchuravy)
- Added
neutralas an additional parameter to mapreduce and accumulate (#20) (@anicusan)
Closed issues:
v0.2.2
AcceleratedKernels v0.2.2
- Added N-dimensional accumulate! implementation
- Added second 1-dimensional accumulate! algorithm which does not need stronger device-wide synchronisation guarantees (which, notably, Apple Metal does not offer, and so decoupled-lookback cannot work on this platform).
- Added extension system with different defaults for accumulate on Metal and any/all on oneAPI. Now all corner cases are tested and work.
- Added higher-order arithmetics functions: sum, prod, minimum, maximum, count, cumsum, cumprod
- Added one final backend::Backend argument to all functions to allow dispatch on them even when the input array is not transferred to the given backend (e.g. allowing ranges on GPUs).
There are no breaking changes - the new interfaces are a strict superset of previous ones.
Merged pull requests:
- Explicitly-defined backends and possible extensions with different defaults per platform (#14) (@anicusan)
- Added new
ScanPrefixaccumulate algorithm (#15) (@anicusan)
Closed issues:
accumulateon Metal sometimes fails due to weaker@synchronizeguarantees than on other platforms (#10)
v0.2.1
AcceleratedKernels v0.2.1
Merged pull requests:
- Add Buildkite CI for CUDA (#9) (@jpsamaroo)
- added foreach + tests. Started updating indices within kernels to use… local types without int64 promotions - about 25% faster in sort for example. Set default block_size to 256 (#11) (@anicusan)
Closed issues:
- Support for a
:serialscheduler (#7)