Simulating Heat Flow in Parallel
CS 441/641 Lecture, Dr. Lawlor
Theoretically, we're solving Laplace's Equation (a special case of the Poisson Problem). In practice, we're just doing repeated neighborhood averaging:
new temperature (x,y) = 0.25 *
(temperature(x-1,y) + temperature(x+1,y) + temperature(x,y-1) +
temperature(x,y+1));
Typically boundary elements, which have no neighbor on one side, are computed in a totally different way, such as:
- Imposing a fixed value, like an inside or outside temperature
- Copying over a value from the interior, making an insulating boundary
This is a naturally parallel problem, but it requires communication
along the boundaries between processors every step. It's a simple
toy example that has the same general communication pattern as more
complex physics codes, such as fluid dynamics or ice sheet modeling.
Here's a simple sequential implementation.
/* Store a 2D array as a row major 1D array */
template <class T>
class array2D {
int wid,ht;
std::vector<T> data; /* wid * ht elements */
public:
array2D(int w,int h) :wid(w), ht(h), data(w*h) {}
// Return array size
inline int nx() const {return wid;}
inline int ny() const {return ht;}
// Manipulate array elements
T &operator() (int x,int y) {return data[y*wid+x];}
T operator() (int x,int y) const {return data[y*wid+x];}
// Swap our data with this array
void swap(array2D<T> &other) {
std::swap(wid,other.wid);
std::swap(ht ,other.ht );
std::swap(data,other.data);
}
};
/* Dump a 2D array to a PPM file */
template <class T>
void write(const array2D<T> &arr,const char *name) {
std::ofstream f(name,std::ios_base::binary);
f<<"P5\n"; // grayscale
f<<arr.nx()<<" "<<arr.ny()<<"\n"; // dimensions
f<<"255\n"; // byte data
for (int y=0;y<arr.ny();y++)
for (int x=0;x<arr.nx();x++)
{
float v=arr(x,y)*255.99;
unsigned char c=(unsigned char)v;
if (v<0) c=0;
if (v>255) c=255;
f.write((char *)&c,1);
}
}
int foo(void) {
int w=1000, h=1000;
array2D<float> cur(w,h);
array2D<float> next(w,h);
// Make initial conditions
for (int y=0;y<cur.ny();y++)
for (int x=0;x<cur.nx();x++)
{
cur(x,y)=fmod(0.01*sqrt(x*x+y*y),1.0);
// subtle: need boundary conditions for next array
next(x,y)=cur(x,y);
}
// Run a few iterations of blurring
enum {nblur=100};
double start=time_in_seconds();
for (int blur=0;blur<nblur;blur++)
{
for (int y=1;y<cur.ny()-1;y++)
for (int x=1;x<cur.nx()-1;x++)
{
next(x,y)=0.25*(cur(x-1,y)+cur(x+1,y)+cur(x,y-1)+cur(x,y+1));
}
cur.swap(next);
}
double elapsed=time_in_seconds()-start;
std::cout<<"Performance: "<<elapsed/((w-2)*(h-2)*nblur)*1.0e9<<" ns/pixel\n";
// Dump final image (good for debugging)
write(cur,"out.ppm");
return 0;
}
(Try this in NetRun now!)
This takes about 10.9 nanoseconds per pixel on my Q6600 in pure sequential mode.
Adding OpenMP around the "blur" loop actually performs well and doesn't
crash, but it doesn't give the right answer, since you can't start
working on iteration 50 until you've finished iteration 49. The
right place to add OpenMP is around the "y" loop, which gives decent
performance.
It's a little tricky to solve this kind of problem using SIMD
instructions. The big problem is the left and right shifts,
although in theory they could be implemented with either unaligned
loads or shuffles (swizzles).
In MPI, the hard part is that if we divide up "cur", both processors
will need to exchange boundary data every step of the outer loop.
This involves sends and receives, and is simplest if you only divide up
one axis, like the Y loop.