CUDA float speedup
CUDA is NVIDIA's GPU programing platform, written in an extended version of C/C++ and compiled with "nvcc".
PyCUDA: A way to interface with CUDA in python programing language
Why use CUDA?
We want to speed up a program that can take advantage of a lot of cores
for example: A program that will calculate the sin of a array of floats
Branching with floats
no need to make a bit mask with simd
Extremely well suited for number crunching. (can do floats and ints fast if you have a lot)
If you have a large dataset to look / sort thro
Why not to use CUDA?
If a program will do a lot of sequential things, and not repeating
For example: Program needs to do
a > b > c and each operation can not be parallelized, or is I/O
dominated
If a program will be using a lot of RAM (most cards max out at 1gig)
If the program can not be parallelized
If you have no NVIDIA based gpu (you can't run CUDA on an ATI gpu, or a multicore CPU)
Why PyCUDA?
PyCUDA provides a lot of nice helper functions built
in, and includes a lot of wrappers around the plain C style CUDA
functions, making it simpler to implement.
PyCUDA also provides a lot of common functions from
CUDA in pre-bulit functions, see:
http://kered.org/blog/2009-04-13/easy-python-numpy-cuda-cublas/
##All examples were compiled with CUDA 4.2, gcc 4.6.3, PyCUDA 2012.1, Ubuntu 12.04 using LXDE, and a EVGA 460 1GB ##
So lets make a simple program in CUDA
##code##
##Robert Avila Hartshorn
##PyCuda ex1.py
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
import numpy
import pycuda.driver as drv
import math
import sys
##CUDA CODE##
# This takes a float array, and calls 'cos' on every element in it
# very limited max size (number of threads)
#
#
mod = SourceModule("""
__global__ void cos_gpu(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] = cos(a[idx]) ;
}
""")
#size = input("Enter a value where x^2 > 512: ")
#print "The matrix size is: ", size*size
for size in range(1,32):
#this is to loop over all
sizes... of same x,y, you can comment this out and uncomment the input
it you want
SIZE_OF_X = size
SIZE_OF_Y = size
if ((SIZE_OF_X*SIZE_OF_Y) >
1024): #max number of threads is 1024 in CUDA v2.x+
cards, in V1.x cards reduce this to 512 between the x and y direction
print "ERR WILL THIS IS BAD"
sys.exit()
#this is just for if we got input from user
##These are for timing##
secgpu = []
seccpu = []
#Time to try it on the GPU#
for i in
range(0,100): ##this is to get a
few runs in to avrage out times
a =
numpy.random.randn(SIZE_OF_X,SIZE_OF_Y).astype(numpy.float32)
#make a random number, as a float, in a 4*4 array
a_cos =
numpy.empty_like(a)
#make a new area that is zeroed
out for the result
##Lets time
this!##
#this area is for timing
start = drv.Event()
end = drv.Event()
start.record()
##timing set up!#
##gpu running time!##
a_gpu =
cuda.mem_alloc(a.nbytes)
#allocate a memory size_of(a) on GPU
cuda.memcpy_htod(a_gpu,a)
#copy a onto the device
func =
mod.get_function("cos_gpu")
#retreves the GPU function doublify, and stores it
as func in python
func(a_gpu,
block=(SIZE_OF_X,SIZE_OF_Y,1))
#calls doublify with the float and block size of 4*4 matrix
cuda.memcpy_dtoh(a_cos, a_gpu)
#copy memory off of the GPU
##end time!##
end.record()
end.synchronize()
secs = start.time_till(end)
secgpu.append(secs)
##end timing function##
#Time to try it on the CPU#
for i in
range(0,100): ##this is to get a
few runs in to avrage out times
b =
numpy.random.randn(SIZE_OF_X,SIZE_OF_Y).astype(numpy.float32)
#make a random number, as a float, in a 4*4 array
b_cos = numpy.empty_like(a)
##Lets time this!##
start = drv.Event()
end = drv.Event()
start.record()
##timing set up!#
for i in range(0,SIZE_OF_X):
for ii in range(0,SIZE_OF_Y):
b_cos[i][ii] =
math.cos(b[i][ii]) ##NOTE: I can
use numpy implantation of cos to make this faster... (think they use
the float trick we learned, but 256bit long with AVX? as my cpu can
handle that, so a giant speedup there)
##numpy version is faster##
#b_cos=numpy.cos(b)
##end time!##
end.record()
end.synchronize()
secs = start.time_till(end)
seccpu.append(secs)
##end time done calculatin##
avggpu = sum(secgpu) / len(secgpu)
avgcpu = sum(seccpu) / len(seccpu)
# print "Starting array:GPU
"
# print a
#Prints the
starting array (if you want to make sure it is doing what I say it is)
# print "A cos:
"
# print a_cos
#Prints the results
# print "Starting array:CPU
"
# print b
#Prints the
starting array (if you want to make sure it is doing what I say it is)
# print "A cos:
"
# print b_cos
#Prints the results
print "Total number of calculated values: ",SIZE_OF_X*SIZE_OF_Y
print "Avg time for on GPU: ", avggpu
print "Avg time for on CPU: ", avgcpu
##endcode##
##OUTPUT## ##(for the last and first few)##
$ python ex1.py
Total number of calculated values: 1
Avg time for on GPU: 0.0527260803431
Avg time for on CPU: 0.00774591999128
Total number of calculated values: 4
Avg time for on GPU: 0.0865897596627
Avg time for on CPU: 0.00869727995247
Total number of calculated values: 9
Avg time for on GPU: 0.0542380798608
Avg time for on CPU: 0.0151286399923
Total number of calculated values: 16
Avg time for on GPU: 0.14174496226
Avg time for on CPU: 0.0201264000498
Total number of calculated values: 25
Avg time for on GPU: 0.0525241601095
Avg time for on CPU: 0.0358937600069
Total number of calculated values: 36
Avg time for on GPU: 0.0719817603752
Avg time for on CPU: 0.0387318400666
##Middle cut out for space##
Total number of calculated values: 784
Avg time for on GPU: 0.0605971200019
Avg time for on CPU: 0.63690208259
Total number of calculated values: 841
Avg time for on GPU: 0.0523747199401
Avg time for on CPU: 0.720301761578
Total number of calculated values: 900
Avg time for on GPU: 0.0531523199007
Avg time for on CPU: 0.72342432142
Total number of calculated values: 961
Avg time for on GPU: 0.0670486401767
Avg time for on CPU: 0.815556480202
##end output##
This was a simple example of PyCUDA, but as you can see not that much
speedup (and a negative speed boost if you use numpy version of cos)
ranging from about 0.14x (aka worse) to 12x.
There is one major problem with this code, it can only scale out to
1024, as that is the max number of threads in a CUDA block (the x * y *
z axies, and V2.x and higher compliant, if it is V1.x the max is 512,
see:Version features and specifications on
http://en.wikipedia.org/wiki/CUDA for more limitations if needed ).
So that is very limiting if we can only calculate out 1024 (or 512)
items at once, so we need to increase this, and how we do that is we
use blocks that it calculates on, each being less then (or equal to) the number of threads
ex2.py:
##code##
##Robert Avila Hartshorn
##PyCuda ex2.py
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
import pycuda.tools
import pycuda.cumath
from pycuda.elementwise import ElementwiseKernel
import numpy
import pycuda.driver as drv
import math
import sys
##Based off of http://wiki.tiker.net/PyCuda/Examples/SimpleSpeedTest
##CUDA CODE##
# This takes a float array, and calls 'cos' on every element in it
# NOTE THE C CODE
#
mod = SourceModule("""
__global__ void cos_gpu_v2(float *out, float *a)
{
/***************
*This is the big change in the CUDA code, we made it so that our index counts for what block number we are in.
*Also we have to use c/c++ style comments here, not python
*************/
const int idx = blockDim.x*blockIdx.x+threadIdx.x;
a[idx] = cos(a[idx]);
out[idx] = a[idx];
}
""")
#size = input("Enter a value for block size: ") #for manual input
block_range =
[64,128,256,512,1024,2048,4096] #to test block sizes
of power of 2, no need for this to be power of 2
for blocks in block_range:
block_size =
128 #can be
1-max thread number (1028 or 512 for V2+ or V1 )
nbr_tot = blocks *
block_size #total items we will
calculate
print "Total number of elements: ", nbr_tot
## for timing ##
secgpu = []
seccpu = []
#Time to try it on the GPU#
for i in
range(0,100): ##this is to get a
few runs in to avrage out times
a =
numpy.random.randn(nbr_tot).astype(numpy.float32)
#make a random number, as a float, in a 4*4 array
a_cos = numpy.empty_like(a)
##Lets time this!##
start = drv.Event()
end = drv.Event()
start.record()
##timing set up!#
func =
mod.get_function("cos_gpu_v2")
#retrieves the GPU function doublify, and stores it
as func in python
func(cuda.Out(a_cos),cuda.In(a),
grid=(blocks,1),block=(block_size,1,1))
#calls the function, and this time we are not doing
distinct separate calls to put data onto the gpu or pull it off, but
rather using the nicer PyCUDA wrapper around it
##end time!##
end.record()
end.synchronize()
secs = start.time_till(end)
secgpu.append(secs)
##timer end##
#Time to try it on the CPU#
for i in
range(0,100): ##this is to get a
few runs in to avrage out times
b =
numpy.random.randn(nbr_tot).astype(numpy.float32)
#make a random number, as a float, in a 4*4 array
b_cos = numpy.empty_like(a)
##Lets time this!##
start = drv.Event()
end = drv.Event()
start.record()
##timing set up!#
#for i in
range(0,nbr_tot): ##for if you
want to do the slow version still, but at these numbers even the numpy
version is faster
# b_cos[i] = math.cos(b[i])
b_cos = numpy.cos(b)
##end time!##
end.record()
end.synchronize()
secs = start.time_till(end)
seccpu.append(secs)
##timer end##
print "The first four elements in the"
print "GPU"
for i in range(0,5):
print "Value
is: ",a[i]," And the cos is: ", a_cos[i]
##print the first 5 to show that it is truly doing
what I say it is
print "CPU"
for i in range(0,5):
print "Value
is: ",b[i]," And the cos is: ", b_cos[i]
##print the first 5 to show that it is truly doing
what I say it is
print "Avg time for on GPU"
avggpu = sum(secgpu) / len(secgpu)
print avggpu
print "Avg time for on CPU"
avgcpu = sum(seccpu) / len(seccpu)
print avgcpu
##OUTPUT## ##last and first 2##
$ python ex2.py
Total number of elements: 8192
The first four elements in the
GPU
Value is: 2.11672 And the cos is: -0.51921
Value is: 0.820839 And the cos is: 0.681608
Value is: -0.86604 And the cos is: 0.647848
Value is: -1.19585 And the cos is: 0.366222
Value is: 1.57808 And the cos is: -0.0072851
CPU
Value is: -0.587186 And the cos is: 0.832503
Value is: 0.785571 And the cos is: 0.706985
Value is: -1.76461 And the cos is: -0.192603
Value is: -0.405069 And the cos is: 0.919075
Value is: 0.632074 And the cos is: 0.806804
Avg time for on GPU: 0.341869118065
Avg time for on CPU: 0.127962560654
Total number of elements: 16384
The first four elements in the
GPU
Value is: 0.165707 And the cos is: 0.986302
Value is: 1.1886 And the cos is: 0.372963
Value is: 1.08671 And the cos is: 0.465397
Value is: -0.568446 And the cos is: 0.842739
Value is: 0.909357 And the cos is: 0.614253
CPU
Value is: 1.29302 And the cos is: 0.274219
Value is: -0.0889364 And the cos is: 0.996048
Value is: 0.724472 And the cos is: 0.74885
Value is: 0.0199612 And the cos is: 0.999801
Value is: -0.158194 And the cos is: 0.987513
Avg time for on GPU: 0.346524159014
Avg time for on CPU: 0.255263360888
##snip out middle##
Total number of elements: 262144
The first four elements in the
GPU
Value is: 0.563329 And the cos is: 0.845482
Value is: 0.107945 And the cos is: 0.99418
Value is: 0.00656393 And the cos is: 0.999978
Value is: -1.7701 And the cos is: -0.197987
Value is: -0.491755 And the cos is: 0.881505
CPU
Value is: 2.65379 And the cos is: -0.883363
Value is: 0.571239 And the cos is: 0.841232
Value is: 2.13526 And the cos is: -0.534959
Value is: 1.69509 And the cos is: -0.123977
Value is: -0.0937435 And the cos is: 0.995609
Avg time for on GPU: 1.18923168063
Avg time for on CPU: 3.9944508934
Total number of elements: 524288
The first four elements in the
GPU
Value is: 0.473183 And the cos is: 0.890122
Value is: 0.791383 And the cos is: 0.702862
Value is: 0.478855 And the cos is: 0.887523
Value is: -0.00359091 And the cos is: 0.999994
Value is: 0.226169 And the cos is: 0.974533
CPU
Value is: 0.138571 And the cos is: 0.990414
Value is: 0.121317 And the cos is: 0.99265
Value is: 0.36968 And the cos is: 0.932443
Value is: -2.42157 And the cos is: -0.751791
Value is: 0.667079 And the cos is: 0.785632
Avg time for on GPU: 1.68935103178
Avg time for on CPU: 8.17050975323
##end output
Speedup ranged from: 0.37x (worse, very few elements) to 4.8x faster
As you can see there is some overhead (loading memory onto the GPU can
be slow) but the more you get the bigger the speedup. Also the
complexity of the problem did not increase a lot (and some of it was
simpler due to the fact that I was using better functions to get memory
on and off the unit).
Branching
Now lets try and get something that is more complex in: BRANCHING!
##this is the same code except I can't use the nice numpy and there is branching, so omitting most of the repeated code
mod = SourceModule("""
__global__ void cos_gpu_v2(float *out, float *a)
{
const int idx = blockDim.x*blockIdx.x+threadIdx.x;
if(a[idx] > 0.5 || a[idx] <
-0.5) ##lets
branch if the value is not -0.5>a[idx]>0.5 to do sin instead
a[idx] = cos(a[idx]);
else
a[idx] = sin(a[idx]);
out[idx] = a[idx];
}
""")
##note: I only did 64 and 128 blocks with the CPU implamatation, due to speed issues
for i in range(0,nbr_tot):
#for(int i = 0;i>nbr_tot;i++) is the equlivent
if(b[i] > 0.5 or b[i] < -0.5):
b_cos[i] = numpy.cos(b[i])
else:
b_cos[i] = numpy.sin(b[i])
##output##
$ python ex3.py
Total number of elements: 8192
The first four elements in the
GPU
Value is: 1.03226 And the sin/cos is: 0.512879
Value is: 1.38142 And the sin/cos is: 0.188243
Value is: -0.800676 And the sin/cos is: 0.696221
Value is: 0.265616 And the sin/cos is: 0.262504
Value is: 0.888949 And the sin/cos is: 0.630228
CPU
Value is: 1.4849 And the sin/cos is: 0.0857926
Value is: 0.512424 And the sin/cos is: 0.871558
Value is: 1.93362 And the sin/cos is: -0.354911
Value is: -2.25762 And the sin/cos is: -0.634087
Value is: 0.542989 And the sin/cos is: 0.856168
Avg time for on GPU
0.377027838081
Avg time for on CPU
75.8683939362
Total number of elements: 16384
The first four elements in the
GPU
Value is: 0.0619954 And the sin/cos is: 0.0619557
Value is: 0.175449 And the sin/cos is: 0.17455
Value is: 0.0701176 And the sin/cos is: 0.0700602
Value is: 2.40369 And the sin/cos is: -0.739879
Value is: -2.93201 And the sin/cos is: -0.978118
CPU
Value is: -0.295365 And the sin/cos is: -0.291089
Value is: -0.312832 And the sin/cos is: -0.307755
Value is: -0.793606 And the sin/cos is: 0.701279
Value is: -0.0676001 And the sin/cos is: -0.0675487
Value is: 1.58798 And the sin/cos is: -0.0171872
Avg time for on GPU
0.359288320541
Avg time for on CPU
154.235956268
Total number of elements: 262144
The first four elements in the
GPU
Value is: -0.49059 And the sin/cos is: -0.471146
Value is: 0.601886 And the sin/cos is: 0.824269
Value is: -0.406498 And the sin/cos is: -0.395395
Value is: 1.2994 And the sin/cos is: 0.268075
Value is: 0.167874 And the sin/cos is: 0.167086
Avg time for on GPU
1.39188000083
Total number of elements: 524288
The first four elements in the
GPU
Value is: -2.3225 And the sin/cos is: -0.682887
Value is: 1.30079 And the sin/cos is: 0.266742
Value is: 0.532119 And the sin/cos is: 0.861734
Value is: 1.03381 And the sin/cos is: 0.511548
Value is: 0.650671 And the sin/cos is: 0.795677
Avg time for on GPU
1.74932351589
Total number of elements: 1048576
The first four elements in the
GPU
Value is: -0.552435 And the sin/cos is: 0.851249
Value is: 0.383028 And the sin/cos is: 0.373731
Value is: 0.362949 And the sin/cos is: 0.355033
Value is: -1.17558 And the sin/cos is: 0.385008
Value is: -0.0931009 And the sin/cos is: -0.0929664
Avg time for on GPU
2.73544128656
##endoutput##
The speedup here starts to get to be big! with the small array
being about 200x faster! (and... I did not want to wait arround for the
larger ones)
So as we can see with this, when we start to introduce branching we can
get a BIG speedup, due to the fact that it is running one float per a
core at a time, and even with the data amount doubling each time, the
amount of time increasing is not double, but increasing slowly (log?)
rate, making things that can be a lot of math intensive stuff get a
speedup with not a lot of work.