GPU & CPU Performance

GPU: Texture Cache Effects

Here's a little program that reads a texture, tex1, at five slightly shifted locations; visually this produces a blurred image.

vec4 sum=vec4(0.0);
for (int i=0;i<5;i++)
	sum+=tex2D(tex1,texcoords*1.0+i*0.01);
gl_FragColor=sum*(1.0/5);

(Try this in NetRun now!)

At the given default scale factor of 1.0, this program takes 0.1 ns per pixel (on NetRun's fast GeForce GTX 280 card).

If we zoom in, to a scale factor of 0.5 or 0.1, the program takes exactly the same time. We're still accessing nearby pixels.

But if we zoom out, to a scale factor of 2.0, like this, then adjacent pixels onscreen get fairly distant texture pixels, and suddenly the program slows down to over 0.23ns per pixel.

...
	sum+=tex2D(tex1,texcoords*2.0+i*0.01);
...

(Try this in NetRun now!)

Zooming out farther slows the access down even more, up to 3ns per pixel with a scale of 16. That's a 30-fold slowdown!

The reason for this is the "texture cache":

When you read a value from a texture, the hardware fetches both it and nearby parts of the texture from RAM.
All the fetched texture is stored in the texture cache, a chunk of on-chip memory designed for speed.
Subsequent reads from that part of the texture are thus quite fast, because they don't have to go to RAM.

When we're zoomed way out, adjacent pixels onscreen are read from distant locations in the texture (texcoords * 16.0 means there are 16 pixels between each read!).

The bottom line: for high performance, read textures in a contiguous fashion (nearby pixels), not random-access (distant pixels).

CPU Cache Effects

The same exact cache hardware exists on the CPU, and similarly, for the best performance, you need good "access locality": if you access memory somewhere, it's substantially faster to access that same thing, or very nearby things, than to jump way far away.

For example, you can measure the access time versus "stride" (distance in bytes between sucessive accesses) with a program like this:

int rep=1000000; /* Number of times to repeat memory accesses */
enum {max_mem=32*1024*1024};
char buf[max_mem]; /* big memory buffer */
int stride=0;

int time_stride(void) {
        unsigned int i,sum=0;
        unsigned int loc=0,mask=max_mem-1,jumpBy=stride;
        for (i=rep;i>0;i--) { /* jump around in buffer, incrementing as we go */
                sum+=buf[loc&mask]++;
                loc+=jumpBy;
        }
        return sum;
}

(Try this in NetRun now!)

On my quad-core Intel machine, this returns:

Stride 1 takes 1.4441 ns/access
Stride 2 takes 1.46815 ns/access
Stride 4 takes 2.1319 ns/access
Stride 8 takes 4.33539 ns/access
Stride 15 takes 8.14121 ns/access
Stride 28 takes 15.114 ns/access
Stride 53 takes 28.2713 ns/access
Stride 99 takes 37.4543 ns/access
Stride 185 takes 38.7671 ns/access

It's like thirty times more expensive to access memory at pointers that are hundreds of bytes apart than to access immediately adjacent pointers!

GPU: Branch Divergence Penalty

On the CPU, the branch performance model is really simple. If you've got code like:
if (something) A(); else B();
On the CPU, if the branch is taken, this takes time A, else it takes time B.

On the GPU, depending on how other adjacent pixels take the branch, this code could take time A+B.

GPU branches work like CPU branches (one or the other) as long as nearby regions of the screen branch the same way ("coherent branches", sorta the branch equivalent of access locality). For example, each call to "cool_colors" takes about 0.1ns per pixel, but because we branch in big onscreen blocks here, this takes a reasonable 0.11ns/pixel overall:

vec4 cool_colors(vec2 where) { /* <- takes about 0.1ns per pixel to execute */
	return vec4(sin(where.x),log(cos(where.y)),exp(where.x),pow(where.x,where.y));
}

void main(void) {
	float doit=fract(texcoords.x*1.0);
	if (doit<0.3) 
		gl_FragColor=cool_colors(texcoords).wxyz*0.5;
	else if (doit<0.7)
		gl_FragColor=cool_colors(1.0-texcoords)*4.5; /* purple */
	else
		gl_FragColor=cool_colors(1.0-texcoords)*0.15; /* dark */
}

(Try this in NetRun now!)

If I change this so "doit" varies much faster onscreen, then adjacent pixels will be taking different branches. The GPU implements this like SSE: you figure out the answer for both branches, then use bitwise operations to mask off the untaken branch. So now the hardware actually has to run "cool_colors" three times for every pixel (one per branch), and our time goes up to 0.285ns/pixel!

vec4 cool_colors(vec2 where) { /* <- takes about 0.1ns per pixel to execute */
	return vec4(sin(where.x),log(cos(where.y)),exp(where.x),pow(where.x,where.y));
}

void main(void) {
	float doit=fract(texcoords.x*100.0);
	if (doit<0.3) 
		gl_FragColor=cool_colors(texcoords).wxyz*0.5;
	else if (doit<0.7)
		gl_FragColor=cool_colors(1.0-texcoords)*4.5; /* purple */
	else
		gl_FragColor=cool_colors(1.0-texcoords)*0.15; /* dark */
}

(Try this in NetRun now!)

Internally, even a graphics card with a thousand "shader cores" really has only a few dozen execution cores, running an SSE-like multiple-floats-per-instruction program. Each execution core is responsible for a small contiguous block of pixels, so if all those pixel branch together, the core can skip the entire "else" case. If some of a core's pixels branch one way, and some branch the other way, the core has to take both branches, and the program slows down appreciably.

NVIDIA calls a group of threads that branch the same way a "warp", and the overall GPU architecture "SIMT": Single Instruction, Multiple Thread. Current NVIDIA machines have 32 threads (floats) per warp.

ATI calls a group of threads that branch the same way a "wavefront", typically 64 floats per wavefront.