CUDA PTX: GPU assembly language

CS 641 Lecture, Dr. Lawlor

CUDA's underlying quasi-assembly language is called PTX. The NVIDIA PTX documentation is the official source, but reading the output from NetRun's "Disassemble" command is pretty illuminating too. If you actually want to write raw PTX for some reason, Kenneth details the commands to use.

There are some good pictures, details, and performance numbers in this 2009 IEEE paper (full text PDF is only accessible from on-campus, so download it).
For example, this CUDA code:

#include <iostream>
#include <cuda.h>
/* error checking */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}

/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x+blockDim.x*blockIdx.x;
	vals[i]=param;
}
int main(int argc,char *argv[]) {
	int w=128, h=2; /* number of threads per block, number of blocks */
	int n=w*h; /* total number of floats */
	float *vals; /* device array of n values */
	check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space
	
	set_array<<<h,w>>>(vals,1.234); /* Initialize the space on the GPU */ 

/* Copy selected elements back to CPU for printing */
	for (int i=0;i<n;i+=5) {
		float f=-999.0; /* CPU copy of value */
		check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
		std::cout<<"vals["<<i<<"] = "<<f<<"\n";
	}
        return 0;
}

(Try this in NetRun now!)

Results in this PTX assembly:

	.entry _Z9set_arrayPff (
		.param .u32 __cudaparm__Z9set_arrayPff_vals,
		.param .f32 __cudaparm__Z9set_arrayPff_param)
	{
	.reg .u16 %rh<4>;
	.reg .u32 %r<8>;
	.reg .f32 %f<3>;
	.loc	28	9	0
 //   8  /* GPU code: set an array to a value */ 
 //   9  __global__ void set_array(float *vals,float param) { 
$LBB1__Z9set_arrayPff:
	.loc	28	11	0
 //  10  	int i=threadIdx.x+blockDim.x*blockIdx.x; 
 //  11  	vals[i]=param; 
	ld.param.f32 	%f1, [__cudaparm__Z9set_arrayPff_param];
	ld.param.u32 	%r1, [__cudaparm__Z9set_arrayPff_vals];
	cvt.u32.u16 	%r2, %tid.x;
	mov.u16 	%rh1, %ctaid.x;
	mov.u16 	%rh2, %ntid.x;
	mul.wide.u16 	%r3, %rh1, %rh2;
	add.u32 	%r4, %r2, %r3;
	mul.lo.u32 	%r5, %r4, 4;
	add.u32 	%r6, %r1, %r5;
	st.global.f32 	[%r6+0], %f1;
	.loc	28	12	0
 //  12  } 
	exit;
$LDWend__Z9set_arrayPff:
	} // _Z9set_arrayPff

Briefly:

CUDA's "threadIdx.x" is PTX "%tid.x".
CUDA's "blockIdx.x" is PTX "%ctaid.x".
CUDA's "blockDim.x" is PTX "%ntid.x".
The GPU has signed, unsigned, and float data types in a variety of sizes. You specify the data type of each instruction with a suffix like ".f32" (for floating point, 32-bit). Bitwise operations take an untyped bits data type like ".b32". There are also ".v2" and ".v4" suffixes for vectors of length 2 or 4 (but not 3!), but these are only used for moves, loads, and stores.
.reg creates a register with the given type and name. <n> creates registers numbered 0 through n-1. Thus ".reg .f32 %f<3>;" creates %f0, %f1, and %f2, all of them 32-bit floats. The compiler seems to generate more register names than are precisely needed, which implies PTX will eliminate unused register names during hardware register allocation.
Arithmetic operations have an optional ".sat" suffix that clamps the output to lie between zero and one (inclusive).

Several interesting architectural details crop up in the CUDA documentation:

The execution model is very unusual, especially compared to typical "what one thread should do" machine code.

A "thread" means one single execution of a kernel. This is mostly conceptual, since the hardware operates on warps of threads.
A "warp" is a group of 32 threads that all take the same branches. A warp is really a SIMD group: a bunch of floats sharing one instruction decoder. The hardware does a good job with predication, so warps aren't "in your face" like with SSE.
A "block" is a group of a few hundred threads that have access to the same "__shared__" memory. The block size is specified in software, but limited by hardware to 512 or 1024 threads maximum. More threads per block is generally better, with serious slowdowns for less than 100 or so threads per block. The PTX manual calls blocks "CTAs" (Cooperate Thread Arrays).
The entire kernel consists of a set of blocks of threads.

The memory model is also highly segmented and specialized, unlike the flat memory of modern CPUs.

"registers" are unique to that thread. Early 8000-series cards had 8192 registers available; GTX 200 series had 16K registers; and the new Fermi GTX 400s have 32K registers. Registers are divided up among threads, so the fewer registers each thread uses, the more threads the machine can keep in flight, hiding latency.
"shared" memory is declared with __shared__, and can be read or written by all the threads in a block. This is handy for neighborhood communication where global memory would be too slow. There is at least 16KB of shared memory available per thread block; Fermi cards can expose up to 48KB with special config options. Use "__syncthreads()__" to synchronize shared writes across a whole thread block.
"global" memory is the central gigabyte or so of GPU RAM. All cudaMemcpy calls go to global memory, which is considered "slow" at only 100GB/second! Older hardware had very strict rules on "global memory coalescing", but luckily newer (Fermi-era) hardware just prefers locality, if you can manage it.
"param" is the PTX abstraction around the parameter-passing protocol. They reserve the right to change this, as hardware and software changes.
constants and compiled program code are stored in their own read-only memories.
"local" memory is unique to each thread, but paradoxically slower than shared memory. Don't use it!

Branching with Predication

PTX conditional branches are implemented using predication. You declare a .pred register, use a "setp" (set predicate) to set the predicate register, and then @ will predicate any instruction, such as a branch, on the comparison result. For example:

__global__ void set_array(float *vals,float param) {
	int i=threadIdx.x+blockDim.x*blockIdx.x;
	if (param<3.0f) vals[i]=param;
}

(Try this in NetRun now!)

	.entry _Z9set_arrayPff (
		.param .u32 __cudaparm__Z9set_arrayPff_vals,
		.param .f32 __cudaparm__Z9set_arrayPff_param)
	{
	.reg .u16 %rh<4>;
	.reg .u32 %r<8>;
	.reg .f32 %f<4>;
	.reg .pred %p<3>;
 //   9  __global__ void set_array(float *vals,float param) { 
$LBB1__Z9set_arrayPff:
	ld.param.f32 	%f1, [__cudaparm__Z9set_arrayPff_param];
	mov.f32 	%f2, 0f40400000;     	// 3
	setp.lt.f32 	%p1, %f1, %f2;
  @!%p1 bra 	$Lt_0_1026;
 //  10  	int i=threadIdx.x+blockDim.x*blockIdx.x; 
 //  11  	if (param<3.0f) vals[i]=param; 
	ld.param.u32 	%r1, [__cudaparm__Z9set_arrayPff_vals];
	cvt.u32.u16 	%r2, %tid.x;
	mov.u16 	%rh1, %ctaid.x;
	mov.u16 	%rh2, %ntid.x;
	mul.wide.u16 	%r3, %rh1, %rh2;
	add.u32 	%r4, %r2, %r3;
	mul.lo.u32 	%r5, %r4, 4;
	add.u32 	%r6, %r1, %r5;
	ld.param.f32 	%f1, [__cudaparm__Z9set_arrayPff_param];
	st.global.f32 	[%r6+0], %f1;
$Lt_0_1026:
	exit;
$LDWend__Z9set_arrayPff:
	} // _Z9set_arrayPf

Here, we use bra to skip over all the index calculation and global memory store if the comparison comes out false.

Predicate registers can be the input to and, or, xor, not, mov instructions, such as for computing nested branches.