-
Notifications
You must be signed in to change notification settings - Fork 174
Set gpu tpb #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Set gpu tpb #736
Changes from all commits
c150e0b
9b8ddd1
680fdda
af1e5cb
aaf5681
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -83,6 +83,14 @@ int isQuESTEnvInit(); | |
| QuESTEnv getQuESTEnv(); | ||
|
|
||
|
|
||
| /** @notyetdoced | ||
| * GPU thread per block control | ||
| * This is somehow probably the best pre-existing place for this. It only really applies to GPU, because for | ||
| * OpenMP the user can just export OMP_NUM_THREADS or call omp_set_num_threads. | ||
| */ | ||
| int getQuESTGpuThreadsPerBlock(); | ||
| void setQuESTGpuThreadsPerBlock(const int NEW_TPB); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Same |
||
|
|
||
|
|
||
| // end de-mangler | ||
| #ifdef __cplusplus | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -509,5 +509,16 @@ void getEnvironmentString(char str[200]) { | |
| } | ||
|
|
||
|
|
||
| int getQuESTGpuThreadsPerBlock() { | ||
| QuESTEnv env = getQuESTEnv(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note Should therefore first call and subsequently use (I see |
||
| return env.isGpuAccelerated? gpu_getNumThreadsPerBlock() : 0; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm I think this is a pitfall. If The situation is slightly different to the GPU cache (fetchable by |
||
| } | ||
|
|
||
| void setQuESTGpuThreadsPerBlock(const int NEW_TPB) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| // just rely on the internal function to throw an error if there's no GPU support compiled | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: validate this is a factor of 32 (and is positive, etc etc)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doc to user: HIP warpsize is 64!
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe a better alternative: add bool gpu_isHipCompiled() {
return (bool) (COMPILE_CUDA && defined(__HIP__));
}Then we can validate explicitly that when GPU-accelerated and we're on HIP, arg must be a multiple of 64, else of 32. This means 32 is required even when not GPU-accelerated; so we make that error message:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should also call |
||
| gpu_setNumThreadsPerBlock(NEW_TPB); | ||
| return; | ||
| } | ||
|
|
||
| // end de-mangler | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,6 +41,7 @@ | |
| #include "quest/src/gpu/cuda_to_hip.hpp" | ||
| #endif | ||
|
|
||
| int numThreadsPerBlock = 128; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should give this a Further, given it's not accessed anywhere outside |
||
|
|
||
|
|
||
| /* | ||
|
|
@@ -330,6 +331,24 @@ qindex gpu_getMaxNumConcurrentThreads() { | |
| * ENVIRONMENT MANAGEMENT | ||
| */ | ||
|
|
||
| int gpu_getNumThreadsPerBlock() { | ||
| #if COMPILE_CUDA | ||
| return numThreadsPerBlock; | ||
| #else | ||
| error_gpuQueriedButGpuNotCompiled(); | ||
| return -1; | ||
| #endif | ||
| } | ||
|
|
||
| void gpu_setNumThreadsPerBlock(const int newThreadsPerBlock) { | ||
| #if COMPILE_CUDA | ||
| numThreadsPerBlock = newThreadsPerBlock; | ||
| #else | ||
| error_gpuQueriedButGpuNotCompiled(); | ||
| #endif | ||
| return; | ||
| } | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we permit users to call the corresponding API functions when GPU acceleration is not enabled, then these guards can be removed entirely. I think that's fair/natural, because we certainly shouldn't introduce an API difference between compiling but not running with GPU acceleration. I would also comment this exception. So this could become: int gpu_getNumThreadsPerBlock() {
// permitted even when GPU backend not compiled
return globlal_numThreadsPerBlock;
}
void gpu_setNumThreadsPerBlock(const int newThreadsPerBlock) {
// permitted even when GPU backend not compiled
global_numThreadsPerBlock = newThreadsPerBlock;
} |
||
|
|
||
|
|
||
| std::array<char,17> getBoundGpuUuid() { | ||
| #if COMPILE_CUDA | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -42,23 +42,19 @@ | |
| * THREAD MANAGEMENT | ||
| */ | ||
|
|
||
|
|
||
| const int NUM_THREADS_PER_BLOCK = 128; | ||
|
|
||
|
|
||
| __forceinline__ __device__ qindex getThreadInd() { | ||
| return blockIdx.x*blockDim.x + threadIdx.x; | ||
| } | ||
|
|
||
|
|
||
| __host__ qindex getNumBlocks(qindex numThreads) { | ||
| __host__ qindex getNumBlocks(qindex numThreads, const int numThreadsPerBlock) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would remove the |
||
|
|
||
| /// @todo | ||
| /// improve this with cudaOccupancyMaxPotentialBlockSize(), | ||
| /// making it function specific | ||
|
|
||
| // CUDA ceil | ||
| return ceil(numThreads / static_cast<qreal>(NUM_THREADS_PER_BLOCK)); | ||
| return ceil(numThreads / static_cast<qreal>(numThreadsPerBlock)); | ||
| } | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include
Numsomewhere, e.g.getNumQuESTGpuThreadsPerBlockgetQuESTNumGpuThreadsPerBlockI've so far tried to avoid abbreviating where feasible.