gl_FragColor=vec4(1,0,0,0);Note there's no loop here, but this program by definition runs on every pixel. In a bigger graphics program, you can control the pixels you want drawn using some polygon geometry, and the program runs on every pixel touched by that geometry.
gl_FragColor=vec4(texcoords.x,texcoords.y,0,0);0 means black, and 1 means fully saturated color (all colors saturated means white). The X coordinate of the onscreen location becomes the Red component of the output color--note how the image gets redder from left to right. The Y coordinate of the onscreen location becomes the Green component of the output color--note how the image gets greener from bottom to top. Red and green add up to yellow, so the top-right corner (where X=Y=1) is yellow.
if (texcoords.x<0.5) /* left half of screen? */We can make a little circle in the middle of the screen red like this:
gl_FragColor=vec4(1,0,0,0); /* red */
else
gl_FragColor=vec4(0,0,1,0); /* blue */
float x=texcoords.x-0.5, y=texcoords.y-0.5;We can make a whole grid of circles across the screen like this:
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
gl_FragColor=vec4(1,0,0,0); /* red */
else
gl_FragColor=vec4(0,0,1,0); /* blue */
vec2 v=fract(texcoords*10);We can make smooth transitions between the colors with a blending operation; this is the "fuzzy logic" equivalent of our mask-based if-then-else implementation.
float x=(v.x-0.5)*1.4, y=v.y-0.5;
float radius=sqrt(x*x+y*y);
if (radius<0.3) /* inside the circle? */
gl_FragColor=vec4(1,0,0,0); /* red */
else
gl_FragColor=vec4(0,0,1,0); /* blue */
float x=texcoords.x-0.5, y=texcoords.y-0.5;
float radius=sqrt(x*x+y*y);
float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);
greeny=greeny*greeny; // sin-squared
vec4 green=vec4(0,texcoords.y,0,0);
vec4 notgreen=vec4(1.0,0.0,0.6,0);
gl_FragColor=greeny*green+(1.0-greeny)*notgreen;
Graphics cards also support loops, like this loop over different sine-wave sources:
float finalgreen=0.0;
for (int source=0;source<2;source++)
{
float x=texcoords.x-0.4-0.1*source, y=texcoords.y-0.5;
float radius=sqrt(x*x+y*y);
float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);
greeny=greeny*greeny; // sin-squared
finalgreen+=greeny*0.33;
}
vec4 green=vec4(0,texcoords.y,0,0);
vec4 notgreen=vec4(1.0,0.0,0.6,0);
gl_FragColor=finalgreen*green+(1.0-finalgreen)*notgreen;
Note the performance of this on the graphics card (a GeForce GTX
280) is 0.1 nanoseconds per pixel. (Essentially) the same code
run on the CPU takes 200 nanoseconds per pixel, which is, er, two
*thousand* times slower. Ouch!
static float texcoords_x=0.1, texcoords_y=0.2; texcoords_x+=0.1;Another example, here's the Mandelbrot set fractal rendered on the graphics card:
float finalgreen=0.0;
for (int source=0;source<2;source++)
{
float x=texcoords_x-0.4-0.1*source, y=texcoords_y-0.5;
float radius=sqrt(x*x+y*y);
float greeny=sin(30.0*radius*3.1415192); // fract(10.0*radius);
greeny=greeny*greeny; // sin-squared
finalgreen+=greeny*0.33;
}
float green=texcoords_y; // vec4(0,texcoords.y,0,0);
float notgreen=100.0; // vec4(1.0,0.0,0.6,0);
return finalgreen*green+(1.0-finalgreen)*notgreen;
vec2 c=vec2(2.0)*(texcoords-0.5)+vec2(0.0,0.0); /* constant c, varies onscreen*/
vec2 z=c;
/* Mandelbrot iteration: keep iterating until z gets big */
for (int i=0;i<15;i++) {
/* break if length of z is >= 4.0 */
if (z.r*z.r+z.g*z.g>=4.0) break;
/* z = z^2 + c; (where z and c are complex numbers) */
z=vec2(
z.r*z.r-z.g*z.g,
2.0*z.r*z.g
)+c;
}
gl_FragColor=fract(vec4(z.r,z.g,0.25*length(z),0));
Here's a greenscreen oscilloscope-style trace:
float x=texcoords.x*20.0-10.0;The bottom line is that GLSL is quite expressive, but it's very graphics-centric. Making things purple for debugging is really a pain!
float y=texcoords.y*3.0-1.5;
float hit=1.0-10.0*abs(y-sin(x)/x);
gl_FragColor=vec4(0.0,hit,0.0,0.0); /* R, G, B, A */
#include <iostream>Basically, we:
#include <cuda.h>
/* error checking macro */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}
/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x+blockDim.x*blockIdx.x; /* find my index */
vals[i]=i+param;
}
int main(int argc,char *argv[]) {
int w=4, h=4; /* number of blocks, threads per block */
int n=w*h; /* total number of floats */
float *vals; /* device array of n values */
check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space
set_array<<<w,h>>>(vals,0.1234); /* Initialize the space on the GPU */
/* Copy a few elements back to CPU for printing */
for (int i=0;i<n;i+=3) {
float f=-999.0; /* CPU copy of value */
check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
std::cout<<"vals["<<i<<"] = "<<f<<"\n";
}
return 0;
}
Several interesting architectural details crop up in the CUDA documentation:
CUDA Concept |
Purpose |
How big? |
CPU Equivalent |
thread |
One single execution of a kernel. This is mostly conceptual, since the hardware operates on warps of threads. | 1 |
function call |
warp (of threads) |
A group of 32 threads that all take the same branches. The hardware does a good job with predication, so warps aren't "in your face" like with SSE instructions. | 32 "threads" |
SIMD unit ("floats") |
block (of threads) |
A group of a few hundred threads
that have access to
the same "__shared__" memory. The block size is specified in
software, and though limited by hardware to 512 or 1024 threads
maximum, typically 256 threads per block performs best. Serious
slowdowns for non power of two, or less than 100
or so threads per block. The PTX manual calls blocks "CTAs"
(Cooperative
Thread Arrays). |
About 256 threads |
One multicore thread (Blocks run SMP/SMT style) |
kernel |
A set of blocks of threads, all running one __global__ function. |
About a million threads |
Parallel loop |
Notice that despite what I think is *intentionally* obfuscated terminology, deep down there's a whole lot of similarity between GPU and CPU architectures; the GPU is just focused on wider problems. The GPU's SIMD units are much wider, 32 floats at a time versus 8 at a time for AVX. A typical GPU will have perhaps 8 multiprocessors (each called an SM or SMX), versus about 4 for a CPU. The big difference is each multiprocessor is deeply hyperthreaded, able to run up to a hundred or so threads, typically limited by the physical register count, nowadays an absurd 65,536 registers per multiprocessor. So you get really ridiculously good GPU performance if your code can fill all the threads, but if you've got single-threaded work, a GPU actually gives ridiculously bad performance!