Graphics Processing Unit (GPU) Programming in CUDA

The Graphics Processing Unit (GPU) is not just for graphics anymore, it's a fully programmable general-purpose computer with incredibly high parallelism. Today's cutting-edge language for GPU programming is CUDA, from NVIDIA, which basically feels like C++ with a few GPU keywords added in.

The idea is you use the GPU alongside the CPU, so a typical CUDA program has this division of labor:

Normal CPU, the "__host__"

Graphics Card, the "__device__"

Runs main, in the usual way

Reads files

Talks on network

Allocates memory for both sides

Invokes GPU kernels

Runs special kernel functions, using blocks of threads

Delivers high performance on compute-intensive tasks

The syntax looks like this:

#include <iostream>
#include <cuda.h>

/* GPU kernel: set an array to a value.  Run with many threads. */
__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x + blockIdx.x*blockDim.x; // <- my thread index
	vals[i]=i+param;
}

int main(int argc,char *argv[]) 
{
// Allocate space shared between CPU and GPU
	int n=16; // total number of floats
	float *vals; // shared array of n values 
	cudaMallocManaged( &vals, n*sizeof(float) ); 

// Run "GPU kernel" on shared space
	set_array<<<1,n>>>(vals,0.1234); /* run kernel on 1 x n GPU threads */ 

	cudaDeviceSynchronize(); /* Wait for kernel to finish filling vals array */

// Show results
	for (int i=0;i<n;i++)
		std::cout<<"vals["<<i<<"] = "<<vals[i]<<"\n";
        return 0;
}

(Try this in NetRun now!)

You use cudaMallocManaged to allocate any data structures shared between CPU and GPU, call a parallel GPU kernel using a special <<<numBlocks, threadsPerBlock>>> syntax, and call cudaDeviceSynchronize to join all the GPU threads afterward.

The example above uses an array of floats, which the GPU prefers, but it also supports integer operations, user-defined classes, and even templates. However, the standard library such as std::string, std::cout, and std::vector, is not yet supported on the GPU in CUDA. Also, double precision is much slower than float on gamer GPUs (NVIDIA sells more expensive Tesla cards with better double performance).

For a real program, running 16 array elements will never be faster than running on the CPU--it's too small to overcome the memory copying and kernel startup latency. Typically you'd run a kernel over a large array using a set of "thread blocks" of 256 threads each, like this:

...

 int n=4*1024*1024; // total number of floats

 int blockDim=256; // threads/block, should be 256 for best performance
 int nBlocks=(n+blockDim-1)/blockDim; // GPU thread blocks to run 
 set_array<<<nBlocks,blockDim>>>(vals,0.1234); /* run kernel on n GPU threads */

...

(Try this in NetRun now!)

If you time the various parts of this code, you find:

Startup: 112.644 ms
Memory Allocation: 3.90291 ms
Kernel: 2.94685 ms
widget[7] = 7.1234
Readback: 0.00691414 ms

(Try this in NetRun now!)

Note that the kernel fills out 16 megs of floats in 3 milliseconds, a rate of 5 gigabytes per second. This is actually still not very good performance, due to overheads accessing unified memory across the PCI-Express bus! (You can get better performance using the GPU-only cudaMalloc and explicit cudaMemcpy.)

Despite the relatively slow access to CPU shared memory, the GPU arithmetic performance is truly incredible. For example, making the kernel compute a sin/exponent shift ten times does not even impact performance--the GPU can do all that arithmetic while still waiting for memory!

__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x + blockIdx.x*blockDim.x; // <- my thread index
	float x=i+param;
	for (int reps=0;reps<10;reps++) x=sin(exp(x));
	vals[i]=x;
}

(Try this in NetRun now!)

Running the same code on a CPU shows the same arithmetic takes hundreds of times longer:

float x=bar;
for (int reps=0;reps<10;reps++) x=sin(exp(x));
return x;

(Try this in NetRun now!)

However, we're not using SIMD or multicore here. For most problems I find if I apply enough effort, combining SSE or AVX with OpenMP, I can get the CPU version to perform competitively with the GPU version, but by that point the CUDA code is actually cleaner and easier to write!

Other GPU Programming Languages

CUDA is a huge language with tons of other features we don't have time to cover here--take CS 441 to get the full details. One downside with CUDA is it's only available on NVIDIA GPUs, not AMD or Intel's increasingly capapble GPUs.

There are a number of other GPU programming languages available, most of which are cross platform, including:

OpenCL is designed to resemble CUDA but is completely cross-platform, working on NVIDIA or ATI GPUs, Intel CPUs (using AVX and multicore automatically), and even FPGAs. The downside is without compiler support, OpenCL is a very call-heavy low level API, making it much more painful to use directly than CUDA (Try this in NetRun now!). I built a wrapper around OpenCL called EPGPU (Try this in NetRun now!), but it's still worse than CUDA.
OpenGL Shader Language (GLSL) is a graphics-centric GPU programming language, although graphics needs the same vector math and arithmetic as many other computations, so it can be used beyond graphics with some limitations, such as being limited to four color channels! You can even call GLSL from WebGL in a web browser.
Microsoft's C++ AMP (Accelerated Massive Parallelism) is a compiler-supported C++11 interface for calling the GPU. On Windows it works on any GPU, but it doesn't have good CPU support or other platforms (iOS, Android, Linux, etc).

CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.