The Graphics Processing Unit (GPU) is not just for graphics anymore, it's a fully programmable general-purpose computer with incredibly high parallelism. Today's cutting-edge language for GPU programming is CUDA, from NVIDIA, which basically feels like C++ with a few GPU keywords added in.
The idea is you use the GPU alongside the CPU, so a typical CUDA program has this division of labor:
Normal CPU, the "__host__" | Graphics Card, the "__device__" |
Runs main, in the usual way Reads files Talks on network Allocates memory for both sides Invokes GPU kernels |
Runs special kernel functions, using blocks of threads Delivers high performance on compute-intensive tasks
|
The syntax looks like this:
#include <iostream> #include <cuda.h> /* GPU kernel: set an array to a value. Run with many threads. */ __global__ void set_array(float *vals,float param) { int i=threadIdx.x + blockIdx.x*blockDim.x; // <- my thread index vals[i]=i+param; } int main(int argc,char *argv[]) { // Allocate space shared between CPU and GPU int n=16; // total number of floats float *vals; // shared array of n values cudaMallocManaged( &vals, n*sizeof(float) ); // Run "GPU kernel" on shared space set_array<<<1,n>>>(vals,0.1234); /* run kernel on 1 x n GPU threads */ cudaDeviceSynchronize(); /* Wait for kernel to finish filling vals array */ // Show results for (int i=0;i<n;i++) std::cout<<"vals["<<i<<"] = "<<vals[i]<<"\n"; return 0; }
You use cudaMallocManaged to allocate any data structures shared between CPU and GPU, call a parallel GPU kernel using a special <<<numBlocks, threadsPerBlock>>> syntax, and call cudaDeviceSynchronize to join all the GPU threads afterward.
The example above uses an array of floats, which the GPU prefers, but it also supports integer operations, user-defined classes, and even templates. However, the standard library such as std::string, std::cout, and std::vector, is not yet supported on the GPU in CUDA. Also, double precision is much slower than float on gamer GPUs (NVIDIA sells more expensive Tesla cards with better double performance).
For a real program, running 16 array elements will never be faster than running on the CPU--it's too small to overcome the memory copying and kernel startup latency. Typically you'd run a kernel over a large array using a set of "thread blocks" of 256 threads each, like this:
...
int n=4*1024*1024; // total number of floats
int blockDim=256; // threads/block, should be 256 for best performance
int nBlocks=(n+blockDim-1)/blockDim; // GPU thread blocks to run
set_array<<<nBlocks,blockDim>>>(vals,0.1234); /* run kernel on n GPU threads */
...
If you time the various parts of this code, you find:
Startup: 112.644 ms Memory Allocation: 3.90291 ms Kernel: 2.94685 ms widget[7] = 7.1234 Readback: 0.00691414 ms
Note that the kernel fills out 16 megs of floats in 3 milliseconds, a rate of 5 gigabytes per second. This is actually still not very good performance, due to overheads accessing unified memory across the PCI-Express bus! (You can get better performance using the GPU-only cudaMalloc and explicit cudaMemcpy.)
Despite the relatively slow access to CPU shared memory, the GPU arithmetic performance is truly incredible. For example, making the kernel compute a sin/exponent shift ten times does not even impact performance--the GPU can do all that arithmetic while still waiting for memory!
__global__ void set_array(float *vals,float param) { int i=threadIdx.x + blockIdx.x*blockDim.x; // <- my thread index float x=i+param; for (int reps=0;reps<10;reps++) x=sin(exp(x)); vals[i]=x; }
Running the same code on a CPU shows the same arithmetic takes hundreds of times longer:
float x=bar; for (int reps=0;reps<10;reps++) x=sin(exp(x)); return x;
However, we're not using SIMD or multicore here. For most problems I find if I apply enough effort, combining SSE or AVX with OpenMP, I can get the CPU version to perform competitively with the GPU version, but by that point the CUDA code is actually cleaner and easier to write!
CUDA is a huge language with tons of other features we don't have time to cover here--take CS 441 to get the full details. One downside with CUDA is it's only available on NVIDIA GPUs, not AMD or Intel's increasingly capapble GPUs.
There are a number of other GPU programming languages available, most of which are cross platform, including:
CS 301 Lecture Note, 2014, Dr. Orion Lawlor, UAF Computer Science Department.