Skip to content

Releases: JuliaGPU/AcceleratedKernels.jl

v0.4.3

23 Jul 19:26
c146374

Choose a tag to compare

AcceleratedKernels v0.4.3

Diff since v0.4.2

  • Made ScanPrefixes the default accumulate / cumsum / cumprod algorithm. It is almost always faster on real-world data than DecoupledLookback, and doesn't depend on cross-block communication (even though theoretically DecoupledLookback has better asymptotic scalability).
  • Prepared AcceleratedKernels for the future PoCL backend becoming the KernelAbstractions CPU default backend; the Threads-based algorithms will remain the defaults until PoCL ones become faster.
  • A lot of housekeeping.

Merged pull requests:

Closed issues:

  • Port over GPUArrays neutral_element fixes (#51)

v0.4.2

01 Jul 12:18
06c2594

Choose a tag to compare

AcceleratedKernels v0.4.2

Diff since v0.4.1

Changes

  • Change the default accumulate algorithm to ScanPrefixes
  • Fix a logic bug in accumalte_nd

Merged pull requests:

Closed issues:

  • Wrong cumsum results on ROCBackend (#41)

v0.4.1

27 May 15:55
6e38e60

Choose a tag to compare

AcceleratedKernels v0.4.1

Diff since v0.4.0

Merged pull requests:

Closed issues:

  • accumulate algorithm selection for Metal implementation being overridden (#37)

v0.4.0

25 May 20:37
14de3f2

Choose a tag to compare

AcceleratedKernels v0.4.0

Diff since v0.3.3

  • Added multithreaded versions to all algorithms: sort (a parallel sample sort deferring to the Julia Base sort for independent slices), mapreduce and accumulate (including N-dimensional reductions).
    • sort scales quite well for a problem with such heavy data dependencies - depending on the problem, we get even 75% strong scaling (e.g. sortperm on UInt32).
    • mapreduce has almost perfect scaling.
    • accumulate on a single thread is not better than the Base one, especially on N-dimensional cases. This is mainly due to calculating index offsets for each outer element - something to improve. It scales well though, and becomes faster at 3-4 threads.
  • Removed the Polyester and OhMyThreads dependencies - now AcceleratedKernels does not bring in any backend stack, only the backend-agnostic KernelAbstractions, GPUArraysCore and ArgCheck. It is quite minimal in its dependencies, and this should help its potential use within GPUArrays.

Breaking changes

  • Removed the scheduler keyword for the multithreaded backend which was from Polyester and OhMyThreads. To update code, simply remove the scheduler kwarg, and the Base Threads will be used (which are extremely fast - with the same performance as OMT or Polyester, but more composable in layered multi-threaded code)

Notes

  • For the GPU backend, there are no breaking changes.

Merged pull requests:

v0.3.3

29 Mar 18:33
0b99fbf

Choose a tag to compare

AcceleratedKernels v0.3.3

Diff since v0.3.2

Merged pull requests:

Closed issues:

  • Reduce slower with KernelAbstractions 0.9.34 (#25)
  • mapreduce broken for multi-thread CPU mapreduce with non-zero init (#28)
  • searchsortedfirst! should accept AbstractRange as x (#29)
  • indice in _forindices_global! should be wrapped by Val. (#30)

v0.3.2

09 Mar 21:20
e3a2eb1

Choose a tag to compare

AcceleratedKernels v0.3.2

Diff since v0.3.1

Merged pull requests:

  • Added unsafe_indices to kernels, with Local/Group indices changes where needed (#26) (@anicusan)

v0.3.1

20 Feb 01:26
c01e7c2

Choose a tag to compare

AcceleratedKernels v0.3.1

Diff since v0.3.0

Merged pull requests:

Closed issues:

  • Conflicting dwarf version errors (#13)
  • accumulate does not respect init as an initial value and treats it as op zero (#16)

v0.3.0

05 Feb 02:36
6661069

Choose a tag to compare

AcceleratedKernels v0.3.0

Diff since v0.2.2

Breaking changes

  • Respecting the init value for mapreduce and accumulate when it is not the binary operator's op zero required the introduction of neutral=GPUArrays.neutral_element(f, T) as a keyword argument, which may break some codes with custom f functions, hence the version increment.
  • If we're doing breaking changes anyways, we changed the any and all functions signatures to the consistent alg specification.

Merged pull requests:

  • CompatHelper: bump compat for oneAPI in [weakdeps] to 2, (keep existing compat) (#18) (@anicusan)
  • mark values used in localmem initialization as uniform (#19) (@vchuravy)
  • Added neutral as an additional parameter to mapreduce and accumulate (#20) (@anicusan)

Closed issues:

  • Support for dims kwarg (#6)
  • Compat Helper PRs not getting created (#17)

v0.2.2

30 Dec 06:24
69462f3

Choose a tag to compare

AcceleratedKernels v0.2.2

Diff since v0.2.1

  • Added N-dimensional accumulate! implementation
  • Added second 1-dimensional accumulate! algorithm which does not need stronger device-wide synchronisation guarantees (which, notably, Apple Metal does not offer, and so decoupled-lookback cannot work on this platform).
    • Added extension system with different defaults for accumulate on Metal and any/all on oneAPI. Now all corner cases are tested and work.
  • Added higher-order arithmetics functions: sum, prod, minimum, maximum, count, cumsum, cumprod
  • Added one final backend::Backend argument to all functions to allow dispatch on them even when the input array is not transferred to the given backend (e.g. allowing ranges on GPUs).

There are no breaking changes - the new interfaces are a strict superset of previous ones.

Merged pull requests:

  • Explicitly-defined backends and possible extensions with different defaults per platform (#14) (@anicusan)
  • Added new ScanPrefix accumulate algorithm (#15) (@anicusan)

Closed issues:

  • accumulate on Metal sometimes fails due to weaker @synchronize guarantees than on other platforms (#10)

v0.2.1

01 Dec 04:24
99247e6

Choose a tag to compare

AcceleratedKernels v0.2.1

Diff since v0.2.0

Merged pull requests:

  • Add Buildkite CI for CUDA (#9) (@jpsamaroo)
  • added foreach + tests. Started updating indices within kernels to use… local types without int64 promotions - about 25% faster in sort for example. Set default block_size to 256 (#11) (@anicusan)

Closed issues:

  • Support for a :serial scheduler (#7)