Nvidia CUDA is just an extension to C programming language. The primitives particularly introduced by Cuda are discussed with detail in this document.
Programming in Cuda itself is heterogeneous in nature. The parallelization offered by GPU is exploited from a CPU. Hence we need to switch in between one and the other for programming GPUs. Following terminology prevails for Cuda in this regard:
- CPU and it's memory as Host
- GPU and it's memory as Device
In CUDA, kernel code is the one that runs on GPU. A kernel function is defined by __global__ keyword in C. Refer to task \ref{sec:task1} for programming your first Cuda kernel.
Sequential and parallel codes complement each other, prior begin synchronous and latter asynchronous in nature. Refer to figure \ref{fig:progFlow} for a visual understanding.
Swithcing in between host and device is through memory.
- Host to Device: copy data in host memory to device memory.
- Device to Host: copy data in device memory to host memory.
Modern GPU architectures also offer a unified memory for device and host. Refer to figure \ref{fig:progModel} for the block diagram of programmer's model for GPU programming.
The function definitions for memory management for device are same as that of a CPU code in C except names. Such functions are preceded by CUDA as following:
- cudaMalloc
- cudaMemcpy
- cudaMemset
- cudaFree
- cudaMemcpyHostToDevice
- cudaMemcpyDeviceToHost
The definition of each being equivalent to the following listing:
cudaError_t cudaMemcpy(void * dst, const void * src, size_t count, cudaMemcpyKind kind);Returns cudaSuccess for successful allocation, cudaErrorMemoryAllocation otherwise.
GPU parallelism is exploited at two different layers. These layers along-with their constituents are:
- Grid
- All threads spawned by a single kernel are collectively called a grid.
- Grids consists of three-dimensional blocks. Usually only two dimensions are defined.
- Block
- A group of threads that can cooperate with each other.
- Blocks consist of three-dimensional threads.
When kernels are launched in device, they are assigned to several threads. Organizing threads is the core of CUDA programming. The threads are divided among grids and block.
A thread is defined by using it's block ID and thread ID. Each of these coordinates is three-dimensional in nature. These IDs are defined within kernel by CUDA itself and can be accessed through structures as:
- blockIdx
- blockIdx.x
- blockIdx.y
- blockIdx.z
- threadIdx
- threadIdx.x
- threadIdx.y
- threadIdx.z
- blockDim: measured in threads.
- gridDim: measured in blocks.
Kernels calls use three ankle brackets to define grid and dimension size. The following listing emphasizes the syntax for above figure.:
dim3 threadsPerBlock(4, 4); //equivalent to (4, 4, 1)
dim3 numBlocks(2,4); //equivalent to (2, 4, 1)
kernelFunction<<<numBlocks, threadsPerBlock>>>(arguments ...);For simpler use cases as just one-dimensional grids and block, following syntax shall suffice:
#define numBlocks 10
#define threadsPerBlock 16
kernelFunction<<<numBlocks, threadsPerBlock>>>(arguments ...);Indexing is the concept that helps programmer code a generic kernel for all the threads. We learn this art throughout the lab tasks.
Cuda launches kernels in GPU. These kernels are defined by ___global___ keyword. Sunch functions can only run on GPU. These are called in angle brackets <<<x,y>>>. Significance of this syntax will be explained later.
The given routine hello.cu is simple enough to elaborate what it does itself. But it does not print anything as this particular functionality is not available in kernels. It is just supposed to elaborate basic Cuda syntax.
Compile and run the hello.cu and check if it compiles and runs.
GPU routines are often sensitive to GPU architectures. Such architectural details can be observed in cudaDeviceProp structure. The following routine elaborates the same.
Compile and run the prop.cu and observe the number of GPUs and their specifications in your system.
Given is a simple vector computation code executing in a sequential manner over a CPU.
The same functionality as in code of Vector Comput on CPU has been implemented here. The only difference in CUDA-enabled code is that it has been parallelized through Cuda-define extensions. Syntax <<<N,1>>> launches as many kernels as elements in the array. blockIdx.x keeps track of thread IDs. Test for same thread ID has been expressed as a good programming practice for debugging.
Compile and run the vec_cpu.cu and vec_gpu.cu; observe the resluts.
Change blockIdx.x to threadIdx.x in line 9 of code in GPU vector snippet. Replace <<<N,1>>> with <<<1,N>>> in line 38 as well. Compile and execute the code.
Observing the maximum thread dimensions allowed for GPU in properties, are you prompted by the expected result? If not, what reason could have made it possible?
Recommend the maximum thread and block dimensions for optimum parallel processing in GPUs.
Assume a kernel has been called for 3 blocks containing 8 threads each. Such a kernel call is expressed as <<< 3 , 8 >>>. The most simplistic indexing technique for this kernel is expressed in the following figure. It can be expressed as equation:
Index = Block ID x Block Size + Thread ID
Call the kernel with following snippet now:
#define threadsPerBlock 64
compute<<<ceil(N/threadsPerBlock), threadsPerBlcok>>>(dev_a, dev_b, dev_c);and change the indexing technique to
// indexing with block ID and thread Id combined
int i = blockId.x*blockDim.x + threadIdx.x;Refer to code snippet multSq.cu.
The kernel call is a combination of threads and blocks. Blocks are responsible for indexing rows while threads index columns. Block index and thread index are utilized such that each kernel compute a product matrix constituent. The following figure elaborates phenomenon in quite an elegant manner.
Compile the code in multSq.cu and observe the output. Verify using MATLAB or any other tool possible.
Evolve the code snippet in multSq.cu for rectangular matrix multiplication.




