Programmable Shaders with GLSL

Non-Programmable Shaders Stink

Back in the day (2000 AD), graphics cards had finally managed to compute all of OpenGL in hardware. They had hardware projection matrices, hardware clipping, hardware transform-and-lighting, hardware texturing, and so on. Folks were thrilled, because glQuake looked amazing and ran great.

There's a problem with hardware, though. It's hard to change.

And no two programmers ever want to do, say, bump mapping exactly the same way. Some want shadows. Some want bump-and-reflect. Some want bump-and-light. Some want light-and-bump. nVidia and ATI were going crazy trying to support every developer's crazy desires in hardware. For example, my ATI card still supports these OpenGL extensions, just for variations on bump/environment mapping:

GL_EXT_texture_env_add, GL_ARB_texture_env_add, GL_ARB_texture_env_combine, 
GL_ARB_texture_env_crossbar, GL_ARB_texture_env_dot3, GL_ARB_texture_mirrored_repeat,
GL_ATI_envmap_bumpmap, GL_ATI_texture_env_combine3, GL_ATIX_texture_env_combine3,
GL_ATI_texture_mirror_once, GL_NV_texgen_reflection, GL_SGI_color_matrix, ...

This was no good. Programmers had good ideas they couldn't get into hardware. Programmers were frustrated trying to understand what the heck the hardware guys had created. Hardware folks were tearing their hair out trying to support "just one more feature" with limited hardware.

The solution to the "too many shading methods to support in hardware" problem is just to support every possible shading method in hardware. The easy way to do this is just make the shading hardware programmable.

So, they did.

Programmable Shaders are Very Simple in Practice

The graphics hardware now lets you do anything you want to incoming vertices and fragments. Your "vertex shader" code literally gets control and figures out where an incoming glVertex should be shown onscreen, then your "fragment shader" figures out what color each pixel should be.

Here's what this looks like. The following is C++ code, relying on the "makeProgramObject" shader-handling function listed below. The vertex and fragment shaders are the strings in the middle. These are very simple shaders, but they can get arbitrarily complicated.

void my_display(void) {
	glClearColor(0,0,0,0); /* erase screen to black */
	glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT);

	/* Set up programmable shaders */
	static GLhandleARB prog=makeProgramObject(
	"//GLSL Vertex shader\n"
	"void main(void) {\n"
	"	gl_Position=gl_ModelViewProjectionMatrix * gl_Vertex;\n"
	"}\n"
	,
	"//GLSL Fragment (pixel) shader\n"
	"void main(void) {\n"
	"	gl_FragColor=vec4(1,0,0,1); /* that is, all pixels are red. */\n"
	"}\n"
	);
	glUseProgramObjectARB(prog);
	
	... glBegin, glVertex, etc.  Ordinary drawing here runs with the above shaders! ...
	
	glutSwapBuffers(); /* as usual... */
}

A few meta-observations first:

Even with programmable shaders, you've still clearly got plenty of normal C++ OpenGL code.
The GLSL programmable shader language is suspiciously similar to C++, Java, C#, etc. This is by design!
The programmable shader goes into OpenGL as a *runtime string*. This means shaders get compiled for your graphics hardware at runtime. This is good! It means the same (C++) executable can run on ATI and nVidia cards (as well as hypothetical future cards like the Spear^tm Asparagon-9000). Your program can supply the shader-strings by:

Hardcoding the shaders into your program, like above.
Reading the shaders from a file (I like "vertex.txt" and "fragment.txt", when I don't hardcode.)
Downloading shaders from the net.
Creating new shaders on the fly (with just string processing!)

The stuff in strings is all "OpenGL Shading Language" (GLSL) code. The official GLSL Language Specification isn't too bad--chapter 7 lists the builtin variables, chapter 8 the builtin functions. Just think of GLSL as plain old C++ with a nice set of 3D vector classes, and you're pretty darn close.

gl_Position is the onscreen location of the vertex. This is the one value the vertex shader is required to output. gl_Position is a "vec4", and stored in the usual OpenGL coordinates, from -1 to +1 on all axes.
gl_Vertex is the vertex's raw C++ location, like as passed to a "glVertex3f(x,y,z);" call.
gl_ModelViewProjectionMatrix is the whole OpenGL matrix stack, including both the GL_PROJECTION and GL_MODELVIEW matrices.
gl_FragColor is the onscreen color of the pixel. This is the one value the fragment shader is required to output. It's a "vec4", and I'm using the constructor-style syntax to initialize it above.

Data types in GLSL work exactly like in C/C++/Java/C#. There are some beautiful builtin datatypes, though:

float. Works exactly like C/C++/Java/C#.
vec4. A class with four floats in it, which you can think of as the XYZW components of a vector, or the RGBA components of a color. vec4 supports + - * / exactly like you'd expect. vec4 is the native datatype of the graphics hardware, so all of these operations are single-clock-cycle.

You can get to the first component of a vec4 named "v" as follows:

"v.x", treating the vec4 as a spatial position or vector.

"v.r", treating the vec4 as a color. This is the same data, the same speed, the same everything as ".x"; it's basically just a comment or a hint to the human reader that you're dealing with a color.

"v[0]", treating the vec4 as an array. Again, it's the same underlying data.

You can initialize a vec4 as follows:

"vec4 v=vec4(0.0);\n", sets all four components to zero.

"vec4 v=vec4(0.1,0.2,0.3,0.4);\n" sets all four components independently.
"vec3 d=vec3(0.1,0.2,0.3);\n"
"vec4 v=vec4(d,0.4);\n"
You can make a 3-vector into a 4-vector by just adding the missing components.

The "w" component is used for homogenous coordinates. It's 1.0 for ordinary position vectors, and 0.0 for direction or offset vectors. You care about this when you're deriving a new projection matrix, but otherwise you usually ignore it.

vec3. A class with three floats in it. Doesn't have a ".w" or ".a" component. Useful for representing directions (surface normals, light directions, etc) when you don't want the "w" component messing up your dot products.
vec2. A class with just two floats. Missing ".z" or ".b" and ".w" or ".a". Useful for representing 2D texture coordinates, or complex numbers.
mat4, mat3, mat2. Matrices that operate on vec4's, vec3's, and vec2's. See my caveats on how to load up the matrix values (the constructor takes column-major order), or just load them from C++ via a builtin like gl_ModelViewMatrix.
"int" usually *isn't* supported for computation (the graphics hardware usually doesn't have integer math!), although GLSL allows it for a loop counter.
A variable declared as "varying" gets transmitted from the vertex shader to the fragment shader. This is the only way to communicate between your vertex and fragment shaders!

Bottom line: programmable shaders really are quite easy to use.

Example GLSL Shaders

Try these out in the 481_glsl demo program! (Zip, Tar-gzip)

Stretch out incoming X coordinates, by dividing by z:

	"//GLSL Vertex shader\n"
	"void main(void) {\n"
	"	vec4 sv = gl_Vertex; \n"
	"	sv.x=sv.x/(1.0+sv.z); /* stretch! */\n"
	"	gl_Position=gl_ModelViewProjectionMatrix * sv;\n"
	"}\n"
	,
	"//GLSL Fragment shader\n"
	"void main(void) {\n"
	"	gl_FragColor=vec4(1,0,0,1);\n"
	"}\n"

Transmit the incoming vertex colors (gl_Color) to the fragment shader, where they're multiplied by red:

	"//GLSL Vertex shader\n"
	"varying vec4 myColor; /*<- goes to fragment shader */ \n"
	"void main(void) {\n"
	"	myColor = gl_Color;\n"
	"	gl_Position=gl_ModelViewProjectionMatrix * gl_Vertex;\n"
	"}\n"
	,
	"//GLSL Fragment shader\n"
	"varying vec4 myColor; /*<- comes from vertex shader */ \n"
	"void main(void) {\n"
	"	gl_FragColor=vec4(1,0,0,1)*myColor;\n"
	"}\n"

Color incoming vertices by their position (red-X, green-Y, blue-Z. A common debugging trick!):

	"//GLSL Vertex shader\n"
	"varying vec4 myColor; /*<- goes to fragment shader */ \n"
	"void main(void) {\n"
	"	myColor = gl_Vertex; /* color-by-position */\n"
	"	gl_Position=gl_ModelViewProjectionMatrix * gl_Vertex;\n"
	"}\n"
	,
	"//GLSL Fragment shader\n"
	"varying vec4 myColor; /*<- comes from vertex shader */ \n"
	"void main(void) {\n"
	"	gl_FragColor=myColor;\n"
	"}\n"

The Joy(?) of the OpenGL Interface

Like many APIs, it takes many OpenGL calls to get anything useful done. Programmable shaders are especially call-heavy--for each of the vertex and fragment shaders, you've got to create a GLhandleARB "ShaderObject", put in your source code, compile that source code, and check for compile errors. Then you've got to create a "ProgramObject", attach the vertex and fragment shaders, link the program, check for link errors, and finally "glUseProgramObjectARB". Then you can render stuff.

The below code does everything but the "glUseProgramObjectARB" and rendering. I've used it for years, and haven't looked at it since 2005. I can't recommend looking at it, or the official ARB_shader_objects extension that describes how these functions work in unintelligible excruciating legalese.

#include <GL/glew.h> /*<- for gl...ARB extentions.  Must call glewInit after glutCreateWindow! */
#include <stdio.h>
#include <stdlib.h> /* for "exit" */
#include <fstream>

// Print an error and exit if this object had a compile error.
void checkShaderOp(GLhandleARB obj,int errtype,const char *where) 
{
        GLint compiled;
        glGetObjectParameterivARB(obj,errtype,&compiled);
        if (!compiled) {
                printf("Compile error on program: %s\n",where);
                enum {logSize=10000};
                char log[logSize]; int len=0;
                glGetInfoLogARB(obj, logSize,&len,log);
                printf("Error Log: \n%s\n",log); exit(1);
        }
}
// Create a vertex or fragment shader from this code.
GLhandleARB makeShaderObject(int target,const char *code)
{
	GLhandleARB h=glCreateShaderObjectARB(target);
	glShaderSourceARB(h,1,&code,NULL);
	glCompileShaderARB(h);
	checkShaderOp(h,GL_OBJECT_COMPILE_STATUS_ARB,code);
	return h;
}
// Create a complete shader object from these chunks of GLSL shader code.
//  You still need to glUseProgramObjectARB(return value);
//  THIS IS THE FUNCTION YOU PROBABLY *DO* WANT TO CALL!!!!  RIGHT HERE!!!!
GLhandleARB makeProgramObject(const char *vertex,const char *fragment)
{
	if (glUseProgramObjectARB==0) 
	{ /* glew never set up, or OpenGL is too old.. */
		std::cout<<"Error!  OpenGL hardware or software too old--no GLSL!\n";
		exit(1);
	}
	GLhandleARB p=glCreateProgramObjectARB();
	glAttachObjectARB(p,
		makeShaderObject(GL_VERTEX_SHADER_ARB,vertex));
	glAttachObjectARB(p,
		makeShaderObject(GL_FRAGMENT_SHADER_ARB,fragment));
	glLinkProgramARB(p);
	checkShaderOp(p,GL_OBJECT_LINK_STATUS_ARB,"link");
	return p;
}
// Read an entire file into a C++ string.
std::string readFileIntoString(const char *fName) {
	char c; std::string ret;
	std::ifstream f(fName);
	if (!f) {ret="Cannot open file ";ret+=fName; return ret;}
	while (f.read(&c,1)) ret+=c;
        return ret;
}
// Create a complete shader object from these GLSL files.
GLhandleARB makeProgramObjectFromFiles(const char *vFile="vertex.txt",
	const char *fFile="fragment.txt")
{
	return makeProgramObject(
		readFileIntoString(vFile).c_str(),
		readFileIntoString(fFile).c_str()
	);
}

Do what I do, kids: write and debug the above code *once*, wrap it in a nice library, call it from everywhere, and get on with your life!

Programmable Shaders are Very Weird Underneath

"May your children live in historic times, and come to the attention of the Emperor."
- (Supposedly) Ancient Chinese Curse. (Hint: are historic times usually good times?)

These are very interesting and historic times for computer science in general, and computer graphics in particular. For fifty years, we've built faster and faster machines that all operate in exactly the same way--they execute the instructions of a stored program (conceptually) one at a time. Consider that Fortran was invented in 1956, for vacuum-tube based computers, but it's still a viable language for programming a Core2 Solo.

Sadly, running one top-down program is just not enough anymore, and Fortran only runs on one core of a Core2 Duo.

Today (2004+) is the dawn of a new era of parallelism, from multi-core CPUs to Field-Programmable Gate Arrays (programmed in VHDL) other weirder logic. Graphics cards are actually among the most interesting parallel hardware out there.

Consider the job of running your pixel shader. Your shader compiles into 20 machine-code instructions. Each of those instructions is likely to take a ten clock cycles or more (because floating-point is slow, there's a divide, etc.). A normal CPU will start the first instruction of the first pixel, and because the second instruction of the first pixel depends on the output of the first instruction, even a fancy superscalar CPU has to just sit there and wait until the first instruction finishes. But your screen has a million pixels. So you have to wait 20 instructions/pixel * 10 cycles/instruction * 1 million pixels = 200 million clock cycles, or about 1/10 second, before the rendering is finished.

By contrast, a GPU pixel shader unit knows how big the screen is. So one GPU pixel shader unit will actually immediately fire off the first instruction of the *second* pixel before it's even done with the first instruction of the first pixel. The GPU pixel shader unit will be starting the first instruction of the *tenth* pixel before the first instruction is finished, and will have *two hundred* pixels "in flight" before the first pixel's shader is complete! Said another way, a GPU uses the natural parallelism of the graphics rendering problem to keep the arithmetic pipelines full, eventually cranking out one result every clock cycle. Ignoring the 200-cycle startup time, you only have to wait 20 million clock cycles, or 1/100 second, before the screen is finished rendering.

Read that again. The GPU is ten times faster, because each pixel unit is busy executing ten instructions at once.

It gets better. Even the crappiest embedded motherboard graphics card has at least two pixel units. A high-end card like an nVidia GeForce 8800 has 32 pixel shader units. So a GPU might actually be several hundred times faster than a sequential machine, and keep several thousand instructions "in flight" and in progress at once.

Superscalar CPUs dream about being able to acheive this sort of parallelism, but they have to painfully, carefully squeeze parallelism from dry sequential machine code designed in the 1950's, dodging dependencies all the way. A GPU is hooked directly to the fire hose of pure natural parallelism inherent in the graphics problem (and many other problems in today's world).

The beautiful part about this is that pixel-level parallelism lets you hide almost any source of latency:

Arithmetic operations can be deeply pipelined, causing latency.
Arithemtic operations can be complicated, like reciprocal-square-root, causing latency.
Memory operations need not hit in the cache for high performance, since memory just adds latency.

Bottom line: GPUs run fast by explointing rendering's inherent problem-level parallelism.