vec4 sum=vec4(0.0);
for (int i=0;i<5;i++)
sum+=tex2D(tex1,texcoords*1.0+i*0.01);
gl_FragColor=sum*(1.0/5);
At the given default scale factor of 1.0, this program takes 0.1 ns per pixel (on NetRun's fast GeForce GTX 280 card).
If we zoom in, to a scale factor of 0.5 or 0.1, the program takes
exactly the same time. We're still accessing nearby pixels.
But if we zoom out, to a scale factor of 2.0, like this, then
adjacent pixels onscreen get fairly distant texture pixels, and
suddenly the program slows down to over 0.23ns per pixel.
...
sum+=tex2D(tex1,texcoords*2.0+i*0.01);
...
Zooming out farther slows the access down even more, up to 3ns per pixel with a scale of 16. That's a 30-fold slowdown!
The reason for this is the "texture cache":
int rep=1000000; /* Number of times to repeat memory accesses */
enum {max_mem=32*1024*1024};
char buf[max_mem]; /* big memory buffer */
int stride=0;
int time_stride(void) {
unsigned int i,sum=0;
unsigned int loc=0,mask=max_mem-1,jumpBy=stride;
for (i=rep;i>0;i--) { /* jump around in buffer, incrementing as we go */
sum+=buf[loc&mask]++;
loc+=jumpBy;
}
return sum;
}
On my quad-core Intel machine, this returns:
Stride 1 takes 1.4441 ns/access
Stride 2 takes 1.46815 ns/access
Stride 4 takes 2.1319 ns/access
Stride 8 takes 4.33539 ns/access
Stride 15 takes 8.14121 ns/access
Stride 28 takes 15.114 ns/access
Stride 53 takes 28.2713 ns/access
Stride 99 takes 37.4543 ns/access
Stride 185 takes 38.7671 ns/access
It's like thirty times more expensive to access memory at pointers that are hundreds of bytes apart than to access immediately adjacent pointers!
On the CPU, the branch performance model is really simple. If you've got code like:
if (something) A(); else B();
On the CPU, if the branch is taken, this takes time A, else it takes time B.
On the GPU, depending on how other adjacent pixels take the branch, this code could take time A+B.
GPU branches work like CPU branches (one or the other) as long as
nearby regions of the screen branch the same way ("coherent branches",
sorta the branch equivalent of access locality). For example,
each call to "cool_colors" takes about 0.1ns per pixel, but because we
branch in big onscreen blocks here, this takes a reasonable
0.11ns/pixel overall:
vec4 cool_colors(vec2 where) { /* <- takes about 0.1ns per pixel to execute */
return vec4(sin(where.x),log(cos(where.y)),exp(where.x),pow(where.x,where.y));
}
void main(void) {
float doit=fract(texcoords.x*1.0);
if (doit<0.3)
gl_FragColor=cool_colors(texcoords).wxyz*0.5;
else if (doit<0.7)
gl_FragColor=cool_colors(1.0-texcoords)*4.5; /* purple */
else
gl_FragColor=cool_colors(1.0-texcoords)*0.15; /* dark */
}
If I change this so "doit" varies much faster onscreen, then
adjacent pixels will be taking different branches. The GPU
implements this like SSE: you figure out the answer for both branches,
then use bitwise operations to mask off the untaken branch. So
now the hardware actually has to run "cool_colors" three times for
every pixel (one per branch), and our time goes up to 0.285ns/pixel!
vec4 cool_colors(vec2 where) { /* <- takes about 0.1ns per pixel to execute */
return vec4(sin(where.x),log(cos(where.y)),exp(where.x),pow(where.x,where.y));
}
void main(void) {
float doit=fract(texcoords.x*100.0);
if (doit<0.3)
gl_FragColor=cool_colors(texcoords).wxyz*0.5;
else if (doit<0.7)
gl_FragColor=cool_colors(1.0-texcoords)*4.5; /* purple */
else
gl_FragColor=cool_colors(1.0-texcoords)*0.15; /* dark */
}
Internally, even a graphics card with a thousand "shader cores"
really has only a few dozen execution cores, running an SSE-like
multiple-floats-per-instruction program. Each execution core is
responsible for a small contiguous block of pixels, so if all those
pixel branch together, the core can skip the entire "else" case.
If some of a core's pixels branch one way, and some branch the other way, the core has to take both branches, and the program slows down appreciably.
NVIDIA
calls a group of threads that branch the same way a "warp", and the
overall GPU architecture "SIMT": Single Instruction, Multiple
Thread. Current NVIDIA machines have 32 threads (floats) per warp.
ATI calls a group of threads that branch the same way a "wavefront", typically 64 floats per wavefront.