Parallelism: Graphics Card Programming
CS 301 Lecture, Dr. Lawlor
Graphics Card Performance
Graphics cards are now several times faster than the CPU. How do they achieve this speed?
It's not because graphics card designers are better paid or smarter than CPU designers, or that the industry is so much bigger:
- Graphics card maker nVidia takes in around $2 billion per year, and has about 2,000 employees (source). ATI is about the same size.
- CPU maker Intel takes in over $30 billion per year, and has about 85,000 employees (source). AMD has $5 billion/year sales, and 16,000 employees.
The difference is that graphics cards run "pixel programs"--a sequence
of instructions to calculate the color of one pixel. The programs
for two adjacent pixels cannot
interact with one another, which means that all the pixel programs are
independent of each other. This implies all the pixels can be
rendered in parallel, with no waiting or synchronization between pixels.
Read that again. That means graphics cards execute a parallel programming language.
Parallelism theoretically allows you to get lots of computing done at a
very low cost. For example, say you've got a 1000x1000 pixel
image. That's a million pixels. If you can build a circuit
to do one floating-point operation to those pixels in 1ns (one
billionth of a second, a typical flop speed nowadays), and you can fit
a million of those circuits on one chip (this is the part that can't be
done at the moment), you've just built a 1,000 teraflop computer.
That's three times faster than the fastest computer in the world, the
$100 million dollar, 128,000-way parallel Blue Gene.
We're not there yet, because we can't fit that much floating-point
circuitry on one chip, but this is the advantage of parallel execution.
As of 2006, the fastest graphics card on the market
renders at least 32 pixels simultaneously. Values stored at each
pixel consist of four 32-bit IEEE floating-point numbers. This
means every clock cycle, the cards are operating on 128 floats at
once. The "LRP" instruction does about 3 flops per float, and
executes in a single clock. At a leisurely 1GHz, the $500 32-pipe nVidia
GeForce 8800 thus would do at least:
3 flops/float*4 floats/pixel*32
pixels/clock*1 Gclocks/second=384 billion flops/second (384 gigaflops)
Recall that a regular FPU only handles one or two (with superscalar
execution) floats at a time, and the SSE/AltiVec extensions only handle
four floats at a time. Even with SSE, the Pentium 4 theoretical
peak performance is about 15 gigaflops, but I can't get more than about
3 gigaflops doing any real work. By contrast, the now-obsolete
Mobility Radeon 9600 graphics card in my laptop handles 4 pixels (16
floats) simultaneously, and pulls about 16 gigaflops, handily beating
the Pentium 4.
Graphics Card Programming
Back in 2002, if you wanted to write a "pixel program" to run on the
graphics card, you had to write nasty, unportable and very low-level
code that only worked with one manufacturer's cards. Today, you
can write code using the OpenGL ARB_fragment_program extension, or GL Shader Language, and run
the exact same code on your ATI and nVidia cards, on your Window
machine, Linux box, or Mac OS machine.
The languages available are:
- OpenGL ARB_fragment_program, which is a very high-performance assembly code we'll look at below.
- OpenGL Shading Language,
which is a portable C-like language. It's performance isn't always quite
as good as assembly, but it's much easier to write complicated shaders.
- nVidia's Cg (C for graphics) is another portable C-like language with bindings to OpenGL or DirectX.
- Microsoft's Windows-specific DirectX High-Level Shading Language is another C-like language, best explored with ATI's Rendermonkey application.
- Microsoft's DirectX Shader Model 3.0 is an assembly-like backend for DirectX 9 cards.
OpenGL ARB_fragment_program code is hence just another assembly code.
The biggest difference is that your program runs once for each *pixel* (in parallel), not
just *once* (in serial). All variables and accesses are done on four-float
"vectors". You can think of these vectors as storing "RGBA"
colors, "XYZW" 3D positions, or you can just think of them as four
floats.
The calling convention for a pixel program in NetRun is slightly
simplified from the general case used for real graphics programs:
- Your pixel's location onscreen is stored in the variable
"in". The x coordinate of this vector gives your on-screen x
coordinate, which varies from 0 on the left side of the screen to 1 on
the right side. The y coordinate gives onscreen y, from 0 at
bottom to 1 at top. In general, input data can be obtained from texture
coordinates passed in from the calling program and/or set up inside a
"vertex program" that runs on each polygon vertex.
- Your pixel's output color is stored in the variable "out".
In general, output data can go to "result.color" (the output color),
"result.depth" (which is used for the Z buffer), or "result.color[i]"
(ARB_draw_buffers "Multiple Render Targets" arrays).
So the simplest OpenGL fragment program is this:
(Executable NetRun Link)
MOV out,in;
Note there's no loop here, but this program by definition
runs on every pixel. In general, you control the pixels you want
drawn using some polygon geometry, and the program runs on every pixel
touched by that geometry.
Note that the output goes on the *left* in this assembly, so this means
"for each pixel, set the output color equal to the input onscreen pixel
location". 0 means black, and 1 means fully saturated color (all
colors saturated means white). The X coordinate of the onscreen
location becomes the Red component of the output color--note how the
image gets redder from left to right. The Y coordinate of the
onscreen location becomes the Green component of the output color--note
how the image gets greener from bottom to top. Red and green add
up to yellow, so the top-right corner (where X=Y=1) is yellow.
Arithmetic
You can do arithmetic using the usual RISC-like instructions "ADD",
"SUB", and "MUL", all of which take 1 clock cycle. For example, consider the program
ADD out,in,0.5; # out=in+0.5
This results in the output getting brighter, since we've added 0.5 to
all the colors. Note that anything less than 0 counts as black,
and anything more than 1 counts as white. This means if you screw
up the range of your output values, you'll get a pure black or white
screen!
There's also a 3-input "MAD" multiply-add instruction. There's no
"DIV", but there is a scalar "RCP" reciprocal estimate and "RSQ" reciprocal-square-root (returns 1.0/sqrt(x)). See the ARB_fragment_program cheat sheet for the complete (and long) list of instructions.
Writemasking
You can also do math where you only modify a few of the output
values. This is called "writemasking". For example, to set
the x component of "in" to 0.0, you can do:
MOV in.x,0.0; # Set input component to 0
MOV out,in;
All the arithmetic instructions work with writemasks, so you can add 0.5 to just the x coordinate using:
ADD in.x,in.x,0.5; # in.x=in.x+0.5
MOV out,in; # out=in
Swizzling
You can rearrange the set of input components to any operation.
This is called "swizzling", where you tack on a 4-character string to
describe how to rearrange the input floats. Each position in the string
corresponds to an output float, and the letter (one of the letters
"xyzw" or "rgba") in the position tells which component of the input to
read from. For example, to copy the x coordinate
of the input to all components of the output, you'd do:
MOV out,in.xxxx;
To interchange the x and y coordinates, you'd do:
MOV out,in.yxzw;
You can also use the color names, like "rgba" in swizzles--they're equivalent to the "xyzw" coordinate names.
The swizzle ".xyzw" or ".rgba" doesn't do anything--it rearranges the components into the same order they started in!
Texturing
The pixel program equivalent of a memory read is a "texture
lookup". Textures are just 2D arrays containing pixels. You
give the texture a coordinate, and it returns you the color at that
position in the texture. The whole array runs from 0 to 1 in both
x and y, so the "texture coordinate" (0.5,0.5) is always the center of
the texture image. This matches up nicely with the "in"
coordinates, so you can see what's in the 1st texture with one
instruction:
TEX out,in,texture[1],2D; # out = texture[1] at coordinate "in".
I can also do postprocesing on the result of the texture lookup, for example by shifting colors around:
(Executable NetRun Link)
TEX out,in,texture[1],2D; # out = texture[1] at coordinate "in".
ADD out.r,out,0.8; # out.r=out.r+0.8
This makes the output image redder, by adding 0.8 to the red component.
I can also change the input coordinates around any way I want, for
example by raising the input x coordinate to the 4th power, which means
I read from texture coordinate (0.125,0.5) (the texture's left edge)
when I give the input coordinate (0.5,0.5) (the screen's center) like
this:
(Executable NetRun Link)<>
MUL in.x,in,in; # Square x once
MUL in.x,in,in; # Square x again
TEX out,in,texture[1],2D; # load texture at modified coordinate
All 3D effects (perspective, lighting, parallax, shading) are created
as simple mathematical transformations of the input coordinates and
output colors!
In general, textures can be used almost anywhere you'd use an array in a sequential program:
- Input data
- Table lookups (e.g., fast nonlinear transformations)
- Temporary storage (results of previous computations)
The one thing you can't
do in a pixel program is write to anything outside the pixel!
This means there's no easy way to, say, sum up all the elements of an
array, or build a histogram of the elements, or even do "ripple carry"
multiprecision addition. Why not? Because variable writes
would create dependencies between pixels, and destroy parallelism!
Links