Programmable Shaders with GLSL

CS 481 Lecture, Dr. Lawlor

Non-Programmable Shaders Stink

Back in the day (2000 AD), graphics cards had finally managed to compute all of OpenGL in hardware.  They had hardware projection matrices, hardware clipping, hardware transform-and-lighting, hardware texturing, and so on.  Folks were thrilled, because glQuake looked amazing and ran great.

There's a problem with hardware, though.  It's hard to change.

And no two programmers ever want to do, say, bump mapping exactly the same way.  Some want shadows.  Some want bump-and-reflect.  Some want bump-and-light.  Some want light-and-bump.  nVidia and ATI were going crazy trying to support every developer's crazy desires in hardware.  For example, my ATI card still supports these OpenGL extensions, just for variations on bump/environment mapping:
GL_EXT_texture_env_add, GL_ARB_texture_env_add, GL_ARB_texture_env_combine, 
GL_ARB_texture_env_crossbar, GL_ARB_texture_env_dot3, GL_ARB_texture_mirrored_repeat,
GL_ATI_envmap_bumpmap, GL_ATI_texture_env_combine3, GL_ATIX_texture_env_combine3,
GL_ATI_texture_mirror_once, GL_NV_texgen_reflection, GL_SGI_color_matrix, ...
This was no good.  Programmers had good ideas they couldn't get into hardware.  Programmers were frustrated trying to understand what the heck the hardware guys had created.  Hardware folks were tearing their hair out trying to support "just one more feature" with limited hardware.

The solution to the "too many shading methods to support in hardware" problem is just to support every possible shading method in hardware.  The easy way to do this is just make the shading hardware programmable. 

So, they did.

Programmable Shaders are Very Simple in Practice

The graphics hardware now lets you do anything you want to incoming vertices and fragments.  Your "vertex shader" code literally gets control and figures out where an incoming glVertex should be shown onscreen, then your "fragment shader" figures out what color each pixel should be.

Here's what this looks like.  The following is C++ code, relying on the "makeProgramObject" shader-handling function listed below.  The vertex and fragment shaders are the strings in the middle.  These are very simple shaders, but they can get arbitrarily complicated.
void my_display(void) {
glClearColor(0,0,0,0); /* erase screen to black */
glClear(GL_COLOR_BUFFER_BIT|GL_DEPTH_BUFFER_BIT);

/* Set up programmable shaders */
static GLhandleARB prog=makeProgramObject(
"//GLSL Vertex shader\n"
"void main(void) {\n"
" gl_Position=gl_ModelViewProjectionMatrix * gl_Vertex;\n"
"}\n"
,
"//GLSL Fragment (pixel) shader\n"
"void main(void) {\n"
" gl_FragColor=vec4(1,0,0,1); /* that is, all pixels are red. */\n"
"}\n"
);
glUseProgramObjectARB(prog);

... glBegin, glVertex, etc. Ordinary drawing here runs with the above shaders! ...

glutSwapBuffers(); /* as usual... */
}
A few meta-observations first:
The stuff in strings is all "OpenGL Shading Language" (GLSL) code.  The official GLSL Language Specification isn't too bad--chapter 7 lists the builtin variables, chapter 8 the builtin functions.  Just think of GLSL as plain old C++ with a nice set of 3D vector classes, and you're pretty darn close. Data types in GLSL work exactly like in C/C++/Java/C#.  There are some beautiful builtin datatypes, though:
Bottom line: programmable shaders really are quite easy to use.

Example GLSL Shaders

Try these out in the 481_glsl demo program! (Zip, Tar-gzip)

Stretch out incoming X coordinates, by dividing by z:
	"//GLSL Vertex shader\n"
"void main(void) {\n"
" vec4 sv = gl_Vertex; \n"
" sv.x=sv.x/(1.0+sv.z); /* stretch! */\n"
" gl_Position=gl_ModelViewProjectionMatrix * sv;\n"
"}\n"
,
"//GLSL Fragment shader\n"
"void main(void) {\n"
" gl_FragColor=vec4(1,0,0,1);\n"
"}\n"
Transmit the incoming vertex colors (gl_Color) to the fragment shader, where they're multiplied by red:
	"//GLSL Vertex shader\n"
"varying vec4 myColor; /*<- goes to fragment shader */ \n"
"void main(void) {\n"
" myColor = gl_Color;\n"
" gl_Position=gl_ModelViewProjectionMatrix * gl_Vertex;\n"
"}\n"
,
"//GLSL Fragment shader\n"
"varying vec4 myColor; /*<- comes from vertex shader */ \n"
"void main(void) {\n"
" gl_FragColor=vec4(1,0,0,1)*myColor;\n"
"}\n"
Color incoming vertices by their position (red-X, green-Y, blue-Z.  A common debugging trick!):
	"//GLSL Vertex shader\n"
"varying vec4 myColor; /*<- goes to fragment shader */ \n"
"void main(void) {\n"
" myColor = gl_Vertex; /* color-by-position */\n"
" gl_Position=gl_ModelViewProjectionMatrix * gl_Vertex;\n"
"}\n"
,
"//GLSL Fragment shader\n"
"varying vec4 myColor; /*<- comes from vertex shader */ \n"
"void main(void) {\n"
" gl_FragColor=myColor;\n"
"}\n"

The Joy(?) of the OpenGL Interface

Like many APIs, it takes many OpenGL calls to get anything useful done.  Programmable shaders are especially call-heavy--for each of the vertex and fragment shaders, you've got to create a GLhandleARB "ShaderObject", put in your source code, compile that source code, and check for compile errors.  Then you've got to create a "ProgramObject", attach the vertex and fragment shaders, link the program, check for link errors, and finally "glUseProgramObjectARB".  Then you can render stuff.

The below code does everything but the "glUseProgramObjectARB" and rendering.  I've used it for years, and haven't looked at it since 2005.  I can't recommend looking at it, or the official ARB_shader_objects extension that describes how these functions work in unintelligible excruciating legalese.
#include <GL/glew.h> /*<- for gl...ARB extentions.  Must call glewInit after glutCreateWindow! */
#include <stdio.h>
#include <stdlib.h> /* for "exit" */
#include <fstream>

// Print an error and exit if this object had a compile error.
void checkShaderOp(GLhandleARB obj,int errtype,const char *where)
{
GLint compiled;
glGetObjectParameterivARB(obj,errtype,&compiled);
if (!compiled) {
printf("Compile error on program: %s\n",where);
enum {logSize=10000};
char log[logSize]; int len=0;
glGetInfoLogARB(obj, logSize,&len,log);
printf("Error Log: \n%s\n",log); exit(1);
}
}
// Create a vertex or fragment shader from this code.
GLhandleARB makeShaderObject(int target,const char *code)
{
GLhandleARB h=glCreateShaderObjectARB(target);
glShaderSourceARB(h,1,&code,NULL);
glCompileShaderARB(h);
checkShaderOp(h,GL_OBJECT_COMPILE_STATUS_ARB,code);
return h;
}
// Create a complete shader object from these chunks of GLSL shader code.
// You still need to glUseProgramObjectARB(return value);
// THIS IS THE FUNCTION YOU PROBABLY *DO* WANT TO CALL!!!! RIGHT HERE!!!!
GLhandleARB makeProgramObject(const char *vertex,const char *fragment)
{
if (glUseProgramObjectARB==0)
{ /* glew never set up, or OpenGL is too old.. */
std::cout<<"Error! OpenGL hardware or software too old--no GLSL!\n";
exit(1);
}
GLhandleARB p=glCreateProgramObjectARB();
glAttachObjectARB(p,
makeShaderObject(GL_VERTEX_SHADER_ARB,vertex));
glAttachObjectARB(p,
makeShaderObject(GL_FRAGMENT_SHADER_ARB,fragment));
glLinkProgramARB(p);
checkShaderOp(p,GL_OBJECT_LINK_STATUS_ARB,"link");
return p;
}
// Read an entire file into a C++ string.
std::string readFileIntoString(const char *fName) {
char c; std::string ret;
std::ifstream f(fName);
if (!f) {ret="Cannot open file ";ret+=fName; return ret;}
while (f.read(&c,1)) ret+=c;
return ret;
}
// Create a complete shader object from these GLSL files.
GLhandleARB makeProgramObjectFromFiles(const char *vFile="vertex.txt",
const char *fFile="fragment.txt")
{
return makeProgramObject(
readFileIntoString(vFile).c_str(),
readFileIntoString(fFile).c_str()
);
}
Do what I do, kids: write and debug the above code *once*, wrap it in a nice library, call it from everywhere, and get on with your life!

Programmable Shaders are Very Weird Underneath

"May your children live in historic times, and come to the attention of the Emperor."
   - (Supposedly) Ancient Chinese Curse.  (Hint: are historic times usually good times?)

These are very interesting and historic times for computer science in general, and computer graphics in particular.  For fifty years, we've built faster and faster machines that all operate in exactly the same way--they execute the instructions of a stored program (conceptually) one at a time.  Consider that Fortran was invented in 1956, for vacuum-tube based computers, but it's still a viable language for programming a Core2 Solo.

Sadly, running one top-down program is just not enough anymore, and Fortran only runs on one core of a Core2 Duo.

Today (2004+) is the dawn of a new era of parallelism, from multi-core CPUs to Field-Programmable Gate Arrays (programmed in VHDL) other weirder logic.  Graphics cards are actually among the most interesting parallel hardware out there.

Consider the job of running your pixel shader.  Your shader compiles into 20 machine-code instructions.  Each of those instructions is likely to take a ten clock cycles or more (because floating-point is slow, there's a divide, etc.).  A normal CPU will start the first instruction of the first pixel, and because the second instruction of the first pixel depends on the output of the first instruction, even a fancy superscalar CPU has to just sit there and wait until the first instruction finishes.  But your screen has a million pixels.  So you have to wait 20 instructions/pixel * 10 cycles/instruction * 1 million pixels = 200 million clock cycles, or about 1/10 second, before the rendering is finished.

By contrast, a GPU pixel shader unit knows how big the screen is.  So one GPU pixel shader unit will actually immediately fire off the first instruction of the *second* pixel before it's even done with the first instruction of the first pixel.  The GPU pixel shader unit will be starting the first instruction of the *tenth* pixel before the first instruction is finished, and will have *two hundred* pixels "in flight" before the first pixel's shader is complete! Said another way, a GPU uses the natural parallelism of the graphics rendering problem to keep the arithmetic pipelines full, eventually cranking out one result every clock cycle.  Ignoring the 200-cycle startup time, you only have to wait 20 million clock cycles, or 1/100 second, before the screen is finished rendering.

Read that again.  The GPU is ten times faster, because each pixel unit is busy executing ten instructions at once.

It gets better.  Even the crappiest embedded motherboard graphics card has at least two pixel units.  A high-end card like an nVidia GeForce 8800 has 32 pixel shader units.  So a GPU might actually be several hundred times faster than a sequential machine, and keep several thousand instructions "in flight" and in progress at once.

Superscalar CPUs dream about being able to acheive this sort of parallelism, but they have to painfully, carefully squeeze parallelism from dry sequential machine code designed in the 1950's, dodging dependencies all the way.  A GPU is hooked directly to the fire hose of pure natural parallelism inherent in the graphics problem (and many other problems in today's world).

The beautiful part about this is that pixel-level parallelism lets you hide almost any source of latency:
Bottom line: GPUs run fast by explointing rendering's inherent problem-level parallelism.