CUDA and Graphics Cards

From csi702

Jump to: navigation, search


Contents

  1. CUDA Overview
  2. GPU Computing History
  3. GPU Architecture
  4. CUDA Language
    1. Memory Hierarchy
    2. Array Access
  5. Improving Performance in CUDA
    1. CUDA Performance examples
  6. Thrust Library
  7. Links & References

1 CUDA Overview

CUDA is a set of c libraries and language extensions and a specialized compiler which can be used to program NVIDA graphics processing units (GPU). CUDA is being used in such diverse fields as: Computational Geoscience, Computational Chemistry, Computational Biology, Computational Modeling, Computational Medicine, Image Processing, and Computational Finance[1]. CUDA algorithms can run multiple orders of magnitude faster than similar algorithms run on CPUs:

ApplicationSpeedup
Interactive visualization of volumetric white matter connectivity146
Isotropic turbulence simulation in Matlab17
Highly optimized object oriented molecular dynamics24
Financial simulation of LIBOR Model with swaptions149
Astrophysics nbody simulation100
Cmatch exact string matching for gene sequencing30

2 GPU Computing History

GPUs were first used for general purpose computation around 1999-2000 in order to take advantage of their superior floating point performance. However, no specialized languages like CUDA existed at that time, which meant scientific codes had to be written in graphics languages like OpenGL and algorithms had to be mapped to graphics operations (textures, vertex transformations, drawing polygons and triangle, etc...). The CUDA architecture is NVIDIA's attempt to provide a true general purpose programming language for general purpose graphics processing unit programming (GPGPU).

3 GPU Architecture

Modern CPUs are multi-core devices with (typically) 4 to 8 cores. Modern GPUs are many-core devices with as many as 512 cores.

CPU GPU Hardware Comparison
CPU GPU Hardware Comparison [1]

4 CUDA Language

CUDA code written to execute on the GPU is grouped into kernels, which are executed in parallel on the GPU by hundreds of individual light weight threads. Each thread executes the same code and has access to a thread id which it uses to calculate array indexes and make flow control decisions. On the surface, this approach is similar to MPI, where individual nodes execute the same code and user their node id to determine what data to access and what actions to perform.

Hardware to synchronize and share data between individual cores is complicated and expensive, especially in GPUs with hundreds of cores. To make this problem tractable, CUDA groups kernel threads into blocks of up to 512 threads. The threads in an individual blocks can pass data through fast shared memory, and can synchronize execution using __syncthreads(). In addition to simplifying chip hardware, this design makes CUDA algorithms highly scalable. As GPUs are created with more and more cores, additional blocks of threads can be launched to achieve higher degrees of parallelism without affecting the underlying algorithm.

4.1 Memory Hierarchy

CUDA uses three levels of memory. The smallest and fastest are registers. Registers are accessible only by a single thread, are stored directly on the chip, and offer very fast access time. In CUDA kernels, local variables are automatically stored in registers with no additional syntax required. However, an array of more than four elements will be stored in global memory instead of in registers!

__global__ void kernel( )  
{
  int i = 0; // i will be stored in a GPU register
}

CUDA shared memory also exists directly on the chip and allows data to be passed within threads in the same block. Access times for shared memory are comparable to registers, but like registers, the amount of shared memory available is limited.

__global__ void kernel( )  
{
  // i is a 10 element array stored in shared memory
  // each block will have an independent copy of i
  // which the threads within that block will share
  __shared__ int i[10];
}

Finally, global memory exists off the chip and is accessible by both the GPU and CPU. Function parameters passed to the kernel are stored in global memory and the GPU reports results back to the GPU by writing to global memory. CUDA then provides function calls which allow the CPU to copy data from global memory into the CPU's address space, where it can be accessed like any other pointer.

Memory allocated with cudaMalloc() is automatically stored in global memory. Variables can be explicitly stored in global memory using the __device__ qualifier before the type.

void main( )
{
  int N = 25600;
  int blockSize = 128;
 
  int size = sizeof(int) * N ;
 
  // allocates an array of N integers in GPU global memory
  int *device_array;
  cudaMalloc( (void **) &device_array , size );
 
  // allocates an array of N integers in CPU memory
  int *host_array;
  host_array (int *) malloc( size );
 
  // execute the kernel, passing a reference to the global memory array
  kernel<<< N / blockSize , blockSize >>>( device_array );
 
  // the kernel has modified the contents of the global memory array
  // to see those modifications we must copy the array into CPU (host) memory
  cudaMemcpy(host_array, device_array, size, cudaMemcpyDeviceToHost);
 
  // free the memory we allocated, both on the CPU and in global memory on the GPU
  cudaFree( device_array );
  free( host_array );
}
 
__global__ void kernel( int *array )  
{
  // index i into the global memory array is stored in a register
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  // accessing an element of the global memory array is hundreds of
  // times slower than accessing a register or shared memory
  array[ i ] = i;
}

4.2 Array Access

Kernel functions often want each thread to load a piece of data from a large array in global memory. To this end, CUDA functions have access to intrinsic variables that describe the block id the thread is part of, the id of the thread within that block, and the block size specified when the kernel was called. Using these values, the thread can assign itself an index in a large array in global memory:

int i = blockDim.x * blockIdx.x + threadIdx.x;

For a two dimensional grid of blocks, the calculation is similar:

int i = blockDim.y * ( blockDim.x * blockIdx.x + threadIdx.x ) + threadIdx.y;

5 Improving Performance in CUDA

Three primary factors are responsible for improvements in efficiency of CUDA code.

  • fine-grain parallelism
  • minimizing CPU-GPU memory transfer
  • coalescing memory access

Fine-Grain Parellelism refers to the breaking up of a task into smaller tasks which can run in parallel, where each thread typically runs less than a few hundred commands.

Minimizing CPU-GPU memory transfer is critical, because latency issues with transfer between internal memories can negate any speedups gained from making the code run in parallel. The transfer speed across the bus can be as low as 10 Gb/s where as internally GPUs can achieve a bandwidth in excess of 100 Gb/s. As a result, even one bottleneck in the code can completely negate any performance benefits. As a result, it is generally a good idea to port all code that can be parallelized to the GPU and not just portions that are causing the current performance bottleneck. Specific commands are written into the code to transfer information from the CPU (host) to the GPU (device), and back again.

Coalescing memory access On CPUs memory in cache can be accessed much faster than memory that is being stored in RAM. Because GPUs do not typically have a high speed shared cache. The concept of coalescing or grouping memory access is very important. Below is an example of coalesced memory access.

Bad memory Access

__global__ void compute_vel(float* dns, float* mtm, float* vel)
{
    int i = (blockDim.x*blockIdx.x + threadIdx.x);
    float dns_i = dns[i];
    for(int j = 0; j < ndim; j++)
        vel[i*ndim+j] = mtm[i*ndim+j] / dns_i;
}

Coalesced Memory Access

__global__ void compute_vel(int N, float* dns, float* mtm, float*vel)
{
    int i = blockDim.x*blockIdx.x + threadIdx.x;
    float dns_i = dns[i];
    for(int j = 0; j < ndim; j++)
        vel[i+j*N] = mtm[i+j*N] / dns_i;
}

Image:CoalescedSpeed.png

Other Performance Considerations

  • Use shared/constant/texture memory whenever possible
  • Overlap kernel execution with data transfer using streams

5.1 CUDA Performance examples

NACA0012 Air Foil
  • Single Precision Benchmarks
    • GPU Recompute vs. Precompute: 3.9x
    • GPU vs. Quad-Core: 9.4x
    • GPU vs. Single-Core: 32.6x
    • Quad-Core vs. Single-Core: 3.5x
  • Double Precision Benchmarks
    • GPU Recompute vs. Precompute: 1.1x
    • GPU vs. Quad-Core: 1.56x
    • GPU vs. Single-Core: 4.7x
    • Quad-Core vs Single- Core: 3x
Missile
  • Single Precision Benchmarks
    • GPU Recompute vs. Precompute: 3.4x
    • GPU vs. Quad-Core: 9.9x
    • GPU vs. Single-Core: 33.6x
    • Quad-Core vs. Single-Core: 3.4x
  • Double Precision Benchmarks
    • GPU Recompute vs. Precompute: 1.63x
    • GPU vs. Quad-Core: 2.5x
    • GPU vs. Single-Core: 7.4x
    • Quad-Core vs Single-Core: 3x


6 Thrust Library

Writing efficient parallel code for non-trivial or non embarrassingly parallel problems can be quite difficult in CUDA. Understanding of the GPU hardware architecture (the lack of caching in global memory or the way threads are physically executed as warps of 32 threads, for example) are critical for achieving good performance. The thrust library attempts to simplify CUDA development by providing efficient implementations for common tasks which are quite difficult to perform efficiently in CUDA [1].

Among other algorithms, thrust provides random number generation, parallel reductions, and parallel sort.

7 Links & References

Personal tools