Graphics Processing Unit (GPU) Programming in GLSL and CUDA

CS 301 Lecture, Dr. Lawlor

A decade ago, if you wanted to write a "pixel program" to run on the graphics card, you had to write nasty, unportable and very low-level code that only worked with one manufacturer's cards. Today, you can write code using a variety of C++-lookalike languages, and then run the exact same code on your ATI and nVidia cards; and build binaries that work the same on your Windows machine, Linux box, or Mac OS X machine.

The languages available are:

C++ style (pointers, bytes, arrays)

CUDA is the current cutting edge language, supporting classes, exceptions, and pointers on the graphics card. Sadly, it only runs on NVIDIA hardware.
OpenCL is a truly cross-platform parallel programming language, running on ATI, NVIDIA, or even Intel multicore hardware, and running on Windows, Mac, or Linux. Sadly, it's a very call-heavy language, but I wrote a small OpenCL wrapper library called EPGPU to simplify coding.
Microsoft's AMP (Accelerated Massive Parallelism) uses C++0x lambdas to implement "parallel_for_each" on either CPU or GPU quite nicely. But it only works on Microsoft Windows 7 or above; any other software (OS X or Linux) won't work.

Graphics style (pixels, textures)

OpenGL Shading Language (GLSL) is a portable C++-like language. It's now part of OpenGL, and the compiler comes as part of every recent OpenGL driver. It runs nearly everywhere, performs well, and makes parallel race conditions *impossible*. The only downside is it's entirely graphics centric, and it's not even possible to fit some applications into the graphics model.
Microsoft's Windows-specific DirectX High-Level Shading Language is another C++-like language, best explored with ATI's Rendermonkey application. Again, it's part of DirectX 9, so you've probably already got the compiler.

Assembly language style (instructions, registers)

NVIDIA PTX, which is the backend for CUDA.
OpenGL ARB_fragment_program, which is a portable assembly code.
Microsoft's DirectX Shader Model 3.0 is an assembly-like backend for DirectX 9 cards.

The bottom line is the graphics card is more or less "just another computer" inside your computer. The biggest difference is that your programs run once for each *pixel* (in parallel), not just *once* (in serial). This is a surprisingly powerful difference.

Graphics Style: GLSL

The OpenGL Shading Language (GLSL) looks and feels a lot like C++. The hardware is like multicore + SSE, where on the graphics card everything runs in parallel (so no writeable globals!), and variables are all floats, either a single "float" or a four-float register called a "vec4". You can also declare variables of type "vec2" and "vec3", and even "int", although int is often emulated with float. You can think of a vec4 as storing "RGBA" colors, "XYZW" 3D positions, or you can just think of them as four adjacent floats.

The output of your GLSL program is "gl_FragColor", the color of the pixel your program is rendering. Your program runs once for each pixel onscreen--in principle, rendering hundreds of thousands of pixels at once!

The simplest OpenGL fragment program returns a constant color, like this program, which always returns red pixels:

gl_FragColor=vec4(1,0,0,0);

(Try this in NetRun now!)

Note there's no loop here, but this program by definition runs on every pixel. In a bigger graphics program, you can control the pixels you want drawn using some polygon geometry, and the program runs on every pixel touched by that geometry.

Here's what our incoming texture coordinates (2D location onscreen) look like:

gl_FragColor=vec4(texcoords.x,texcoords.y,0,0);

(Try this in NetRun now!)

0 means black, and 1 means fully saturated color (all colors saturated means white). The X coordinate of the onscreen location becomes the Red component of the output color--note how the image gets redder from left to right. The Y coordinate of the onscreen location becomes the Green component of the output color--note how the image gets greener from bottom to top. Red and green add up to yellow, so the top-right corner (where X=Y=1) is yellow.

We can make the left half of the screen red, and the right half blue, like this:

if (texcoords.x<0.5) /* left half of screen? */
	gl_FragColor=vec4(1,0,0,0); /* red */
else
	gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

We can make a little circle in the middle of the screen red like this:

float x=texcoords.x-0.5, y=texcoords.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
	gl_FragColor=vec4(1,0,0,0); /* red */
else
	gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

We can make a whole grid of circles across the screen like this:

vec2 v=fract(texcoords*10);
float x=(v.x-0.5)*1.4, y=v.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
	gl_FragColor=vec4(1,0,0,0); /* red */
else
	gl_FragColor=vec4(0,0,1,0); /* blue */

(Try this in NetRun now!)

We can make smooth transitions between the colors with a blending operation; this is the "fuzzy logic" equivalent of our mask-based if-then-else implementation.

float x=texcoords.x-0.5, y=texcoords.y-0.5;
float radius=sqrt(x*x+y*y);
float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);
greeny=greeny*greeny; // sin-squared

vec4 green=vec4(0,texcoords.y,0,0);
vec4 notgreen=vec4(1.0,0.0,0.6,0);
gl_FragColor=greeny*green+(1.0-greeny)*notgreen;

(Try this in NetRun now!)

Graphics cards also support loops, like this loop over different sine-wave sources:

float finalgreen=0.0;
for (int source=0;source<2;source++) 
{
	float x=texcoords.x-0.4-0.1*source, y=texcoords.y-0.5;
	float radius=sqrt(x*x+y*y);
	float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);
	greeny=greeny*greeny; // sin-squared
	finalgreen+=greeny*0.33;
}

vec4 green=vec4(0,texcoords.y,0,0);
vec4 notgreen=vec4(1.0,0.0,0.6,0);
gl_FragColor=finalgreen*green+(1.0-finalgreen)*notgreen;

(Try this in NetRun now!)

Note the performance of this on the graphics card (a GeForce GTX 280) is 0.1 nanoseconds per pixel. (Essentially) the same code run on the CPU takes 200 nanoseconds per pixel, which is, er, two *thousand* times slower. Ouch!

static float texcoords_x=0.1, texcoords_y=0.2; texcoords_x+=0.1;
float finalgreen=0.0;
for (int source=0;source<2;source++) 
{
	float x=texcoords_x-0.4-0.1*source, y=texcoords_y-0.5;
	float radius=sqrt(x*x+y*y);
	float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);
	greeny=greeny*greeny; // sin-squared
	finalgreen+=greeny*0.33;
}

float green=texcoords_y; // vec4(0,texcoords.y,0,0);
float notgreen=100.0; // vec4(1.0,0.0,0.6,0);
return finalgreen*green+(1.0-finalgreen)*notgreen;

(Try this in NetRun now!)

Another example, here's the Mandelbrot set fractal rendered on the graphics card:

vec2 c=vec2(2.0)*(texcoords-0.5)+vec2(0.0,0.0); /* constant c, varies onscreen*/
vec2 z=c;
/* Mandelbrot iteration: keep iterating until z gets big */
for (int i=0;i<15;i++) {
	/* break if length of z is >= 4.0 */
	if (z.r*z.r+z.g*z.g>=4.0) break;
	/* z = z^2 + c;  (where z and c are complex numbers) */
	z=vec2(
		z.r*z.r-z.g*z.g,
		2.0*z.r*z.g
	)+c;
}
gl_FragColor=fract(vec4(z.r,z.g,0.25*length(z),0));

(Try this in NetRun now!)

Here's a greenscreen oscilloscope-style trace:

float x=texcoords.x*20.0-10.0;
float y=texcoords.y*3.0-1.5;
float hit=1.0-10.0*abs(y-sin(x)/x);
gl_FragColor=vec4(0.0,hit,0.0,0.0); /* R, G, B, A */

(Try this in NetRun now!)

The bottom line is that GLSL is quite expressive, but it's very graphics-centric. Making things purple for debugging is really a pain!

CUDA: Beyond Graphics

Here's an example of CUDA, which lets you actually use pointers and classes from the graphics card, and is used for all sorts of non-graphics problems. There's tons of setup code, but the key operation happening on the GPU is just putting a value into each element of an array.

#include <iostream>
#include <cuda.h>
/* error checking macro */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x+blockDim.x*blockIdx.x; /* find my index */
	vals[i]=i+param;
}
int main(int argc,char *argv[]) {
	int w=4, h=4; /* number of blocks, threads per block */
	int n=w*h; /* total number of floats */
	float *vals; /* device array of n values */
	check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space
	
	set_array<<<w,h>>>(vals,0.1234); /* Initialize the space on the GPU */ 

/* Copy a few elements back to CPU for printing */
	for (int i=0;i<n;i+=3) {
		float f=-999.0; /* CPU copy of value */
		check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
		std::cout<<"vals["<<i<<"] = "<<f<<"\n";
	}
        return 0;
}

(Try this in NetRun now!)

Basically, we:

Start by including cuda.h.
Define a funky GPU (__global__) function "set_array". It writes to a GPU array named "vals" at its own thread index. The hardware calls set_array for every thread.
Define an ordinary C++ CPU main function.
Allocate space on the GPU with cudaMalloc. You get a pointer back, but it's a pointer into GPU space, so you can't dereference it, only hand it back to the GPU!
Call your funky GPU function using the funky "kernel invocation syntax". Inside the <<< >>> goes the number of GPU "blocks" of threads, then the number of threads per block. The total parallelism is blocks*threads.
Copy data back to main using a cudaMemcpy. Be sure to check for errors when you do this, unlike above!

Several interesting architectural details crop up in the CUDA documentation:

CUDA Concept	Purpose	How big?	CPU Equivalent
thread	One single execution of a kernel. This is mostly conceptual, since the hardware operates on warps of threads.	1	function call
warp (of threads)	A group of 32 threads that all take the same branches. The hardware does a good job with predication, so warps aren't "in your face" like with SSE instructions.	32 "threads"	SIMD unit ("floats")
block (of threads)	A group of a few hundred threads that have access to the same "__shared__" memory. The block size is specified in software, and though limited by hardware to 512 or 1024 threads maximum, typically 256 threads per block performs best. Serious slowdowns for non power of two, or less than 100 or so threads per block. The PTX manual calls blocks "CTAs" (Cooperative Thread Arrays).	About 256 threads	One multicore thread (Blocks run SMP/SMT style)
kernel	A set of blocks of threads, all running one __global__ function.	About a million threads	Parallel loop

Notice that despite what I think is *intentionally* obfuscated terminology, deep down there's a whole lot of similarity between GPU and CPU architectures; the GPU is just focused on wider problems. The GPU's SIMD units are much wider, 32 floats at a time versus 8 at a time for AVX. A typical GPU will have perhaps 8 multiprocessors (each called an SM or SMX), versus about 4 for a CPU. The big difference is each multiprocessor is deeply hyperthreaded, able to run up to a hundred or so threads, typically limited by the physical register count, nowadays an absurd 65,536 registers per multiprocessor. So you get really ridiculously good GPU performance if your code can fill all the threads, but if you've got single-threaded work, a GPU actually gives ridiculously bad performance!