CUDA PTX: GPU assembly language
CS 641 Lecture, Dr. Lawlor
CUDA's underlying quasi-assembly language is called PTX. The NVIDIA PTX documentation
is the official source, but reading the output from NetRun's
"Disassemble" command is pretty illuminating too. If you actually
want to write raw PTX for some reason, Kenneth details the commands to use.
There are some good pictures, details, and performance numbers in this 2009 IEEE paper (full text PDF is only accessible from on-campus, so download it).
For example, this CUDA code:
#include <iostream>
#include <cuda.h>
/* error checking */
#define check(cudacall) { int err=cudacall; if (err!=cudaSuccess) std::cout<<"CUDA ERROR "<<err<<" at line "<<__LINE__<<"'s "<<#cudacall<<"\n";}
/* GPU code: set an array to a value */
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x+blockDim.x*blockIdx.x;
vals[i]=param;
}
int main(int argc,char *argv[]) {
int w=128, h=2; /* number of threads per block, number of blocks */
int n=w*h; /* total number of floats */
float *vals; /* device array of n values */
check(cudaMalloc( (void**) &vals, n*sizeof(float) )); //Allocate some space
set_array<<<h,w>>>(vals,1.234); /* Initialize the space on the GPU */
/* Copy selected elements back to CPU for printing */
for (int i=0;i<n;i+=5) {
float f=-999.0; /* CPU copy of value */
check(cudaMemcpy(&f,&vals[i],sizeof(float),cudaMemcpyDeviceToHost));
std::cout<<"vals["<<i<<"] = "<<f<<"\n";
}
return 0;
}
(Try this in NetRun now!)
Results in this PTX assembly:
.entry _Z9set_arrayPff (
.param .u32 __cudaparm__Z9set_arrayPff_vals,
.param .f32 __cudaparm__Z9set_arrayPff_param)
{
.reg .u16 %rh<4>;
.reg .u32 %r<8>;
.reg .f32 %f<3>;
.loc 28 9 0
// 8 /* GPU code: set an array to a value */
// 9 __global__ void set_array(float *vals,float param) {
$LBB1__Z9set_arrayPff:
.loc 28 11 0
// 10 int i=threadIdx.x+blockDim.x*blockIdx.x;
// 11 vals[i]=param;
ld.param.f32 %f1, [__cudaparm__Z9set_arrayPff_param];
ld.param.u32 %r1, [__cudaparm__Z9set_arrayPff_vals];
cvt.u32.u16 %r2, %tid.x;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r3, %rh1, %rh2;
add.u32 %r4, %r2, %r3;
mul.lo.u32 %r5, %r4, 4;
add.u32 %r6, %r1, %r5;
st.global.f32 [%r6+0], %f1;
.loc 28 12 0
// 12 }
exit;
$LDWend__Z9set_arrayPff:
} // _Z9set_arrayPff
Briefly:
- CUDA's "threadIdx.x" is PTX "%tid.x".
- CUDA's "blockIdx.x" is PTX "%ctaid.x".
- CUDA's "blockDim.x" is PTX "%ntid.x".
- The GPU has signed, unsigned, and float data types in a variety of
sizes. You specify the data type of each instruction with a suffix
like ".f32" (for floating point, 32-bit). Bitwise operations take an untyped bits data type like ".b32". There
are also ".v2" and ".v4" suffixes for vectors of length 2 or 4 (but not
3!), but these are only used for moves, loads, and stores.
- .reg creates a register with the given type and name. <n> creates registers numbered 0 through n-1. Thus ".reg
.f32 %f<3>;" creates %f0, %f1, and %f2, all of them 32-bit
floats. The compiler seems to generate more register names than
are precisely needed, which implies PTX will eliminate unused register
names during hardware register allocation.
- Arithmetic operations have an optional ".sat" suffix that clamps the output to lie between zero and one (inclusive).
Several interesting architectural details crop up in the CUDA documentation:
- The execution model is very unusual, especially compared to typical "what one thread should do" machine code.
- A "thread" means one single execution of a kernel. This is
mostly conceptual, since the hardware operates on warps of threads.
- A "warp" is a group of 32 threads that all take the same
branches. A warp is really a SIMD group: a bunch of floats sharing one
instruction decoder. The hardware does a good job with predication, so
warps aren't "in your face" like with SSE.
- A "block" is a group of a few hundred threads that have access to
the same "__shared__" memory. The block size is specified in software,
but limited by hardware to 512 or 1024 threads maximum. More threads
per block is generally better, with serious slowdowns for less than 100
or so threads per block. The PTX manual calls blocks "CTAs" (Cooperate
Thread Arrays).
- The entire kernel consists of a set of blocks of threads.
- The memory model is also highly segmented and specialized, unlike the flat memory of modern CPUs.
- "registers" are unique to that thread. Early 8000-series
cards had 8192 registers available; GTX 200 series had 16K registers;
and the new Fermi GTX 400s have 32K registers. Registers are
divided up among threads, so the fewer registers each thread uses, the
more threads the machine can keep in flight, hiding latency.
- "shared" memory is declared with __shared__, and can be read or
written by all the threads in a block. This is handy for
neighborhood
communication where global memory would be too slow. There is at
least 16KB of shared memory available per thread block; Fermi cards can
expose up to 48KB with special config options. Use "__syncthreads()__" to synchronize shared writes across a whole thread block.
- "global" memory is the central gigabyte or so of GPU RAM. All cudaMemcpy calls go to global memory, which is considered "slow" at only 100GB/second!
Older hardware had very strict rules on "global memory coalescing", but
luckily newer (Fermi-era) hardware just prefers locality, if you can
manage it.
- "param" is the PTX abstraction around the parameter-passing
protocol. They reserve the right to change this, as hardware and
software changes.
- constants and compiled program code are stored in their own read-only memories.
- "local" memory is unique to each thread, but paradoxically slower than shared memory. Don't use it!
Branching with Predication
PTX conditional branches are implemented using predication. You
declare a .pred register, use a "setp" (set predicate) to set the
predicate register, and then @ will predicate any instruction, such as
a branch, on the comparison result. For example:
__global__ void set_array(float *vals,float param) {
int i=threadIdx.x+blockDim.x*blockIdx.x;
if (param<3.0f) vals[i]=param;
}
(Try this in NetRun now!)
.entry _Z9set_arrayPff (
.param .u32 __cudaparm__Z9set_arrayPff_vals,
.param .f32 __cudaparm__Z9set_arrayPff_param)
{
.reg .u16 %rh<4>;
.reg .u32 %r<8>;
.reg .f32 %f<4>;
.reg .pred %p<3>;
// 9 __global__ void set_array(float *vals,float param) {
$LBB1__Z9set_arrayPff:
ld.param.f32 %f1, [__cudaparm__Z9set_arrayPff_param];
mov.f32 %f2, 0f40400000; // 3
setp.lt.f32 %p1, %f1, %f2;
@!%p1 bra $Lt_0_1026;
// 10 int i=threadIdx.x+blockDim.x*blockIdx.x;
// 11 if (param<3.0f) vals[i]=param;
ld.param.u32 %r1, [__cudaparm__Z9set_arrayPff_vals];
cvt.u32.u16 %r2, %tid.x;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r3, %rh1, %rh2;
add.u32 %r4, %r2, %r3;
mul.lo.u32 %r5, %r4, 4;
add.u32 %r6, %r1, %r5;
ld.param.f32 %f1, [__cudaparm__Z9set_arrayPff_param];
st.global.f32 [%r6+0], %f1;
$Lt_0_1026:
exit;
$LDWend__Z9set_arrayPff:
} // _Z9set_arrayPf
Here, we use bra to skip over all the index calculation and global memory store if the comparison comes out false.
Predicate registers can be the input to and, or, xor, not, mov instructions, such as for computing nested branches.