CUDA and Graphics Cards
From csi702
Contents |
1 CUDA Overview
CUDA is a set of c libraries and language extensions and a specialized compiler which can be used to program NVIDA graphics processing units (GPU). CUDA is being used in such diverse fields as: Computational Geoscience, Computational Chemistry, Computational Biology, Computational Modeling, Computational Medicine, Image Processing, and Computational Finance[1]. CUDA algorithms can run multiple orders of magnitude faster than similar algorithms run on CPUs:
| Application | Speedup |
|---|---|
| Interactive visualization of volumetric white matter connectivity | 146 |
| Isotropic turbulence simulation in Matlab | 17 |
| Highly optimized object oriented molecular dynamics | 24 |
| Financial simulation of LIBOR Model with swaptions | 149 |
| Astrophysics nbody simulation | 100 |
| Cmatch exact string matching for gene sequencing | 30 |
2 GPU Computing History
GPUs were first used for general purpose computation around 1999-2000 in order to take advantage of their superior floating point performance. However, no specialized languages like CUDA existed at that time, which meant scientific codes had to be written in graphics languages like OpenGL and algorithms had to be mapped to graphics operations (textures, vertex transformations, drawing polygons and triangle, etc...). The CUDA architecture is NVIDIA's attempt to provide a true general purpose programming language for general purpose graphics processing unit programming (GPGPU).
3 GPU Architecture
Modern CPUs are multi-core devices with (typically) 4 to 8 cores. Modern GPUs are many-core devices with as many as 512 cores.
4 CUDA Language
CUDA code written to execute on the GPU is grouped into kernels, which are executed in parallel on the GPU by hundreds of individual light weight threads. Each thread executes the same code and has access to a thread id which it uses to calculate array indexes and make flow control decisions. On the surface, this approach is similar to MPI, where individual nodes execute the same code and user their node id to determine what data to access and what actions to perform.
Hardware to synchronize and share data between individual cores is complicated and expensive, especially in GPUs with hundreds of cores. To make this problem tractable, CUDA groups kernel threads into blocks of up to 512 threads. The threads in an individual blocks can pass data through fast shared memory, and can synchronize execution using __syncthreads(). In addition to simplifying chip hardware, this design makes CUDA algorithms highly scalable. As GPUs are created with more and more cores, additional blocks of threads can be launched to achieve higher degrees of parallelism without affecting the underlying algorithm.
4.1 Memory Hierarchy
CUDA uses three levels of memory. The smallest and fastest are registers. Registers are accessible only by a single thread, are stored directly on the chip, and offer very fast access time. In CUDA kernels, local variables are automatically stored in registers with no additional syntax required. However, an array of more than four elements will be stored in global memory instead of in registers!
__global__ void kernel( ) { int i = 0; // i will be stored in a GPU register }
CUDA shared memory also exists directly on the chip and allows data to be passed within threads in the same block. Access times for shared memory are comparable to registers, but like registers, the amount of shared memory available is limited.
__global__ void kernel( ) { // i is a 10 element array stored in shared memory // each block will have an independent copy of i // which the threads within that block will share __shared__ int i[10]; }
Finally, global memory exists off the chip and is accessible by both the GPU and CPU. Function parameters passed to the kernel are stored in global memory and the GPU reports results back to the GPU by writing to global memory. CUDA then provides function calls which allow the CPU to copy data from global memory into the CPU's address space, where it can be accessed like any other pointer.
Memory allocated with cudaMalloc() is automatically stored in global memory. Variables can be explicitly stored in global memory using the __device__ qualifier before the type.
void main( ) { int N = 25600; int blockSize = 128; int size = sizeof(int) * N ; // allocates an array of N integers in GPU global memory int *device_array; cudaMalloc( (void **) &device_array , size ); // allocates an array of N integers in CPU memory int *host_array; host_array (int *) malloc( size ); // execute the kernel, passing a reference to the global memory array kernel<<< N / blockSize , blockSize >>>( device_array ); // the kernel has modified the contents of the global memory array // to see those modifications we must copy the array into CPU (host) memory cudaMemcpy(host_array, device_array, size, cudaMemcpyDeviceToHost); // free the memory we allocated, both on the CPU and in global memory on the GPU cudaFree( device_array ); free( host_array ); } __global__ void kernel( int *array ) { // index i into the global memory array is stored in a register int i = blockIdx.x * blockDim.x + threadIdx.x; // accessing an element of the global memory array is hundreds of // times slower than accessing a register or shared memory array[ i ] = i; }
4.2 Array Access
Kernel functions often want each thread to load a piece of data from a large array in global memory. To this end, CUDA functions have access to intrinsic variables that describe the block id the thread is part of, the id of the thread within that block, and the block size specified when the kernel was called. Using these values, the thread can assign itself an index in a large array in global memory:
int i = blockDim.x * blockIdx.x + threadIdx.x;
For a two dimensional grid of blocks, the calculation is similar:
int i = blockDim.y * ( blockDim.x * blockIdx.x + threadIdx.x ) + threadIdx.y;
5 Improving Performance in CUDA
Three primary factors are responsible for improvements in efficiency of CUDA code.
- fine-grain parallelism
- minimizing CPU-GPU memory transfer
- coalescing memory access
Fine-Grain Parellelism refers to the breaking up of a task into smaller tasks which can run in parallel, where each thread typically runs less than a few hundred commands.
Minimizing CPU-GPU memory transfer is critical, because latency issues with transfer between internal memories can negate any speedups gained from making the code run in parallel. The transfer speed across the bus can be as low as 10 Gb/s where as internally GPUs can achieve a bandwidth in excess of 100 Gb/s. As a result, even one bottleneck in the code can completely negate any performance benefits. As a result, it is generally a good idea to port all code that can be parallelized to the GPU and not just portions that are causing the current performance bottleneck. Specific commands are written into the code to transfer information from the CPU (host) to the GPU (device), and back again.
Coalescing memory access On CPUs memory in cache can be accessed much faster than memory that is being stored in RAM. Because GPUs do not typically have a high speed shared cache. The concept of coalescing or grouping memory access is very important. Below is an example of coalesced memory access.
Bad memory Access
__global__ void compute_vel(float* dns, float* mtm, float* vel) { int i = (blockDim.x*blockIdx.x + threadIdx.x); float dns_i = dns[i]; for(int j = 0; j < ndim; j++) vel[i*ndim+j] = mtm[i*ndim+j] / dns_i; }
Coalesced Memory Access
__global__ void compute_vel(int N, float* dns, float* mtm, float*vel) { int i = blockDim.x*blockIdx.x + threadIdx.x; float dns_i = dns[i]; for(int j = 0; j < ndim; j++) vel[i+j*N] = mtm[i+j*N] / dns_i; }
Other Performance Considerations
- Use shared/constant/texture memory whenever possible
- Overlap kernel execution with data transfer using streams
5.1 CUDA Performance examples
| |
| |
6 Thrust Library
Writing efficient parallel code for non-trivial or non embarrassingly parallel problems can be quite difficult in CUDA. Understanding of the GPU hardware architecture (the lack of caching in global memory or the way threads are physically executed as warps of 32 threads, for example) are critical for achieving good performance. The thrust library attempts to simplify CUDA development by providing efficient implementations for common tasks which are quite difficult to perform efficiently in CUDA [1].
Among other algorithms, thrust provides random number generation, parallel reductions, and parallel sort.


