Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions quest/include/environment.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ int isQuESTEnvInit();
QuESTEnv getQuESTEnv();


/** @notyetdoced
* GPU thread per block control
* This is somehow probably the best pre-existing place for this. It only really applies to GPU, because for
* OpenMP the user can just export OMP_NUM_THREADS or call omp_set_num_threads.
*/
int getQuESTGpuThreadsPerBlock();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include Num somewhere, e.g.

  • getNumQuESTGpuThreadsPerBlock
  • getQuESTNumGpuThreadsPerBlock

I've so far tried to avoid abbreviating where feasible.

void setQuESTGpuThreadsPerBlock(const int NEW_TPB);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same Num) consideration as for getQuESTGpuThreadsPerBlock)



// end de-mangler
#ifdef __cplusplus
Expand Down
11 changes: 11 additions & 0 deletions quest/src/api/environment.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -509,5 +509,16 @@ void getEnvironmentString(char str[200]) {
}


int getQuESTGpuThreadsPerBlock() {
QuESTEnv env = getQuESTEnv();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note getQuESTEnv() is an API function with its own validation, and shouldn't be called internally like this since if its validation throws, it will claim the user called getQuESTEnv(), when they actually called getQuESTGpuThreadsPerBlock().

Should therefore first call

    validate_envIsInit(__func__);

and subsequently use globalEnvPtr directly.

(I see getEnvironmentString() calls getQuESTEnv() for some reason, when it too should just use globalEnvPtr)

return env.isGpuAccelerated? gpu_getNumThreadsPerBlock() : 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think this is a pitfall. If setQuESTGpuThreadsPerBlock() is permitted in non-GPU mode (and I really believe it should for healthy platform agnosticism), then the user would always get back 0 in lieu of what they had just passed to set. Maybe we should just always return gpu_getNumThreadsPerBlock().

The situation is slightly different to the GPU cache (fetchable by getGpuCacheSize() and clearable via clearGpuCache()), because that offers no setter. Users can always safely call both in non-GPU accelerated mode, and the former will return 0 (which is always correct).

}

void setQuESTGpuThreadsPerBlock(const int NEW_TPB) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NEW_TPB -> numThreadsPerBlock or numTPB, etc

// just rely on the internal function to throw an error if there's no GPU support compiled
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: validate this is a factor of 32 (and is positive, etc etc)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc to user: HIP warpsize is 64!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better alternative: add gpu_isHipCompiled() in gpu_config.cpp, right under gpu_isCuQuantumCompiled(), as:

bool gpu_isHipCompiled() {
    return (bool) (COMPILE_CUDA && defined(__HIP__));
}

Then we can validate explicitly that when GPU-accelerated and we're on HIP, arg must be a multiple of 64, else of 32. This means 32 is required even when not GPU-accelerated; so we make that error message:

The number of threads per block must be a multiple of 32 (or on AMD GPUs, 64)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also call

validate_envIsInit(__func__);

gpu_setNumThreadsPerBlock(NEW_TPB);
return;
}

// end de-mangler
}
4 changes: 1 addition & 3 deletions quest/src/cpu/cpu_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,9 +79,7 @@ int cpu_getAvailableNumThreads() {
#if COMPILE_OPENMP
int n = -1;

#pragma omp parallel shared(n)
#pragma omp single
n = omp_get_num_threads();
n = omp_get_max_threads();
Comment thread
TysonRayJones marked this conversation as resolved.

return n;
#else
Expand Down
19 changes: 19 additions & 0 deletions quest/src/gpu/gpu_config.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
#include "quest/src/gpu/cuda_to_hip.hpp"
#endif

int numThreadsPerBlock = 128;
Copy link
Copy Markdown
Member

@TysonRayJones TysonRayJones May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should give this a global_ prefix, like here (like I failed to do for hasGpuBeenBound, oops!)

Further, given it's not accessed anywhere outside gpu_(g|s)etNumThreadsPerBlock(), I would move this definition to the ENVIRONMENT MANAGEMENT section, just before gpu_getNumThreadsPerBlock().



/*
Expand Down Expand Up @@ -330,6 +331,24 @@ qindex gpu_getMaxNumConcurrentThreads() {
* ENVIRONMENT MANAGEMENT
*/

int gpu_getNumThreadsPerBlock() {
#if COMPILE_CUDA
return numThreadsPerBlock;
#else
error_gpuQueriedButGpuNotCompiled();
return -1;
#endif
}

void gpu_setNumThreadsPerBlock(const int newThreadsPerBlock) {
#if COMPILE_CUDA
numThreadsPerBlock = newThreadsPerBlock;
#else
error_gpuQueriedButGpuNotCompiled();
#endif
return;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we permit users to call the corresponding API functions when GPU acceleration is not enabled, then these guards can be removed entirely. I think that's fair/natural, because we certainly shouldn't introduce an API difference between compiling but not running with GPU acceleration.

I would also comment this exception. So this could become:

int gpu_getNumThreadsPerBlock() {
    // permitted even when GPU backend not compiled
    return globlal_numThreadsPerBlock;
}

void gpu_setNumThreadsPerBlock(const int newThreadsPerBlock) {
    // permitted even when GPU backend not compiled
    global_numThreadsPerBlock = newThreadsPerBlock;
}



std::array<char,17> getBoundGpuUuid() {
#if COMPILE_CUDA
Expand Down
8 changes: 5 additions & 3 deletions quest/src/gpu/gpu_config.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
#include "quest/include/channels.h"



/*
* CUDA ERROR HANDLING
*/
Expand Down Expand Up @@ -65,6 +64,10 @@ qindex gpu_getMaxNumConcurrentThreads();
* ENVIRONMENT MANAGEMENT
*/

int gpu_getNumThreadsPerBlock();

void gpu_setNumThreadsPerBlock(const int newThreadsPerBlock);

void gpu_bindLocalGPUsToNodes();

bool gpu_areAnyNodesBoundToSameGpu();
Expand All @@ -76,7 +79,6 @@ void gpu_initCuQuantum();
void gpu_finalizeCuQuantum();



/*
* MEMORY MANAGEMENT
*/
Expand Down Expand Up @@ -122,4 +124,4 @@ size_t gpu_getCacheMemoryInBytes();



#endif // GPU_CONFIG_HPP
#endif // GPU_CONFIG_HPP
8 changes: 2 additions & 6 deletions quest/src/gpu/gpu_kernels.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,19 @@
* THREAD MANAGEMENT
*/


const int NUM_THREADS_PER_BLOCK = 128;


__forceinline__ __device__ qindex getThreadInd() {
return blockIdx.x*blockDim.x + threadIdx.x;
}


__host__ qindex getNumBlocks(qindex numThreads) {
__host__ qindex getNumBlocks(qindex numThreads, const int numThreadsPerBlock) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the const qualifier since it's asymmetric and a bit counterproductive, because it makes the reader why why isn't numThreads const


/// @todo
/// improve this with cudaOccupancyMaxPotentialBlockSize(),
/// making it function specific

// CUDA ceil
return ceil(numThreads / static_cast<qreal>(NUM_THREADS_PER_BLOCK));
return ceil(numThreads / static_cast<qreal>(numThreadsPerBlock));
}


Expand Down
Loading
Loading