Releases · JuliaGPU/AcceleratedKernels.jl

23 Jul 19:26

github-actions

v0.4.3

c146374

v0.4.3 Latest

Latest

AcceleratedKernels v0.4.3

Diff since v0.4.2

Made ScanPrefixes the default accumulate / cumsum / cumprod algorithm. It is almost always faster on real-world data than DecoupledLookback, and doesn't depend on cross-block communication (even though theoretically DecoupledLookback has better asymptotic scalability).
Prepared AcceleratedKernels for the future PoCL backend becoming the KernelAbstractions CPU default backend; the Threads-based algorithms will remain the defaults until PoCL ones become faster.
A lot of housekeeping.

Merged pull requests:

Typo in accumulate benchmarks (#42) (@christiangnrd)
Use UnsafeAtomics to fix race in accumulate (#44) (@vchuravy)
Stop relying on backend type to determine algorithm used (#45) (@christiangnrd)
Test both 1d accumulate algorithms when supported (#49) (@christiangnrd)
neutral_element fixes (#52) (@christiangnrd)
Deduplicate reduce_group (#55) (@christiangnrd)
Tweak backend selection (#56) (@christiangnrd)
Vc/accumulate alg: made ScanPrefixes the default accumulate algorithm; added atomic orderings to DecoupledLookback. (#57) (@anicusan)

Closed issues:

Port over GPUArrays neutral_element fixes (#51)

Contributors

vchuravy, anicusan, and christiangnrd

Assets 2

01 Jul 12:18

github-actions

v0.4.2

06c2594

v0.4.2

AcceleratedKernels v0.4.2

Diff since v0.4.1

Changes

Change the default accumulate algorithm to ScanPrefixes
Fix a logic bug in accumalte_nd

Merged pull requests:

Fix for accumulate by block (#47) (@christiangnrd)
Switch default algorithm for accumulate to ScanPrefixes (#48) (@vchuravy)

Closed issues:

Wrong cumsum results on ROCBackend (#41)

Contributors

vchuravy and christiangnrd

Assets 2

27 May 15:55

github-actions

v0.4.1

6e38e60

v0.4.1

AcceleratedKernels v0.4.1

Diff since v0.4.0

Merged pull requests:

Address #37 (#38) (@christiangnrd)
[NFC] Reduce kwarg duplication (#39) (@christiangnrd)

Closed issues:

accumulate algorithm selection for Metal implementation being overridden (#37)

Contributors

christiangnrd

Assets 2

25 May 20:37

github-actions

v0.4.0

14de3f2

v0.4.0

AcceleratedKernels v0.4.0

Diff since v0.3.3

Added multithreaded versions to all algorithms: sort (a parallel sample sort deferring to the Julia Base sort for independent slices), mapreduce and accumulate (including N-dimensional reductions).
- sort scales quite well for a problem with such heavy data dependencies - depending on the problem, we get even 75% strong scaling (e.g. sortperm on UInt32).
- mapreduce has almost perfect scaling.
- accumulate on a single thread is not better than the Base one, especially on N-dimensional cases. This is mainly due to calculating index offsets for each outer element - something to improve. It scales well though, and becomes faster at 3-4 threads.
Removed the Polyester and OhMyThreads dependencies - now AcceleratedKernels does not bring in any backend stack, only the backend-agnostic KernelAbstractions, GPUArraysCore and ArgCheck. It is quite minimal in its dependencies, and this should help its potential use within GPUArrays.

Breaking changes

Removed the scheduler keyword for the multithreaded backend which was from Polyester and OhMyThreads. To update code, simply remove the scheduler kwarg, and the Base Threads will be used (which are extremely fast - with the same performance as OMT or Polyester, but more composable in layered multi-threaded code)

Notes

For the GPU backend, there are no breaking changes.

Merged pull requests:

Fix accumulate! default argument bug (#31) (@christiangnrd)
Remove GPUArrays dependency (#32) (@christiangnrd)
Multithreaded CPU sample sort (#34) (@anicusan)
[NFC] Split up tests into different files (#35) (@christiangnrd)
[Metal] Use safe block_size in accumulate (#36) (@christiangnrd)
Added multithreaded implementations of 1D and ND accumulate, mapreduce. Removed Polyester and OhMyThreads dependencies. (#40) (@anicusan)

Contributors

anicusan and christiangnrd

Assets 2

29 Mar 18:33

github-actions

v0.3.3

0b99fbf

v0.3.3

AcceleratedKernels v0.3.3

Diff since v0.3.2

Merged pull requests:

Fix CPU CI (#27) (@christiangnrd)

Closed issues:

Reduce slower with KernelAbstractions 0.9.34 (#25)
mapreduce broken for multi-thread CPU mapreduce with non-zero init (#28)
searchsortedfirst! should accept AbstractRange as x (#29)
indice in _forindices_global! should be wrapped by Val. (#30)

Contributors

christiangnrd

Assets 2

09 Mar 21:20

github-actions

v0.3.2

e3a2eb1

v0.3.2

AcceleratedKernels v0.3.2

Diff since v0.3.1

Merged pull requests:

Added unsafe_indices to kernels, with Local/Group indices changes where needed (#26) (@anicusan)

Contributors

anicusan

Assets 2

20 Feb 01:26

github-actions

v0.3.1

c01e7c2

v0.3.1

AcceleratedKernels v0.3.1

Diff since v0.3.0

Merged pull requests:

typos (#21) (@anicusan)
Compute accumulate destination type based on the op (#24) (@pxl-th)

Closed issues:

Conflicting dwarf version errors (#13)
accumulate does not respect init as an initial value and treats it as op zero (#16)

Contributors

pxl-th and anicusan

Assets 2

05 Feb 02:36

github-actions

v0.3.0

6661069

v0.3.0

AcceleratedKernels v0.3.0

Diff since v0.2.2

Breaking changes

Respecting the init value for mapreduce and accumulate when it is not the binary operator's op zero required the introduction of neutral=GPUArrays.neutral_element(f, T) as a keyword argument, which may break some codes with custom f functions, hence the version increment.
If we're doing breaking changes anyways, we changed the any and all functions signatures to the consistent alg specification.

Merged pull requests:

CompatHelper: bump compat for oneAPI in [weakdeps] to 2, (keep existing compat) (#18) (@anicusan)
mark values used in localmem initialization as uniform (#19) (@vchuravy)
Added neutral as an additional parameter to mapreduce and accumulate (#20) (@anicusan)

Closed issues:

Support for dims kwarg (#6)
Compat Helper PRs not getting created (#17)

Contributors

vchuravy and anicusan

Assets 2

30 Dec 06:24

github-actions

v0.2.2

69462f3

v0.2.2

AcceleratedKernels v0.2.2

Diff since v0.2.1

Added N-dimensional accumulate! implementation
Added second 1-dimensional accumulate! algorithm which does not need stronger device-wide synchronisation guarantees (which, notably, Apple Metal does not offer, and so decoupled-lookback cannot work on this platform).
- Added extension system with different defaults for accumulate on Metal and any/all on oneAPI. Now all corner cases are tested and work.
Added higher-order arithmetics functions: sum, prod, minimum, maximum, count, cumsum, cumprod
Added one final backend::Backend argument to all functions to allow dispatch on them even when the input array is not transferred to the given backend (e.g. allowing ranges on GPUs).

There are no breaking changes - the new interfaces are a strict superset of previous ones.

Merged pull requests:

Explicitly-defined backends and possible extensions with different defaults per platform (#14) (@anicusan)
Added new ScanPrefix accumulate algorithm (#15) (@anicusan)

Closed issues:

accumulate on Metal sometimes fails due to weaker @synchronize guarantees than on other platforms (#10)

Contributors

anicusan

Assets 2

01 Dec 04:24

github-actions

v0.2.1

99247e6

v0.2.1

AcceleratedKernels v0.2.1

Diff since v0.2.0

Merged pull requests:

Add Buildkite CI for CUDA (#9) (@jpsamaroo)
added foreach + tests. Started updating indices within kernels to use… local types without int64 promotions - about 25% faster in sort for example. Set default block_size to 256 (#11) (@anicusan)

Closed issues:

Support for a :serial scheduler (#7)

Contributors

jpsamaroo and anicusan

Assets 2

Releases: JuliaGPU/AcceleratedKernels.jl

v0.4.3

AcceleratedKernels v0.4.3

Contributors

Uh oh!

v0.4.2

AcceleratedKernels v0.4.2

Changes

Contributors

Uh oh!

v0.4.1

AcceleratedKernels v0.4.1

Contributors

Uh oh!

v0.4.0

AcceleratedKernels v0.4.0

Breaking changes

Notes

Contributors

Uh oh!

v0.3.3

AcceleratedKernels v0.3.3

Contributors

Uh oh!

v0.3.2

AcceleratedKernels v0.3.2

Contributors

Uh oh!

v0.3.1

AcceleratedKernels v0.3.1

Contributors

Uh oh!

v0.3.0

AcceleratedKernels v0.3.0

Breaking changes

Contributors

Uh oh!

v0.2.2

AcceleratedKernels v0.2.2

Contributors

Uh oh!

v0.2.1

AcceleratedKernels v0.2.1

Contributors

Uh oh!