Lock-Free Multicore Datastructures
CS 641 Lecture, Dr. Lawlor
Basic multicore design is pretty simple: just add locks where multiple
threads need to modify the same data structure. This works better
if there are only a few threads, because the contention for the lock is
low.
As multicore machines get bigger, like hundreds or thousands of
processors, locks generally don't scale well: if you only have a few
locks, lock contention becomes enormous; if you have many fine-grained
locks, the lock overhead starts to add up.
So for large multicore machines, "lock-free" designs are common.
This is a pretty elastic term, covering a range of strategies:
- We don't use locks every time, but only when we really need them (e.g., only for writes, never for reads).
- We don't use locks per se, but we build basically the same thing ourselves (loop until free).
- We don't use locks, instead we use atomic instructions.
- We don't use locks, because we assume multiple threads never access the same data.
- We don't use locks, because we can prove that multiple threads can never access the same data at the same time.
One serious complicating factor is the machine's "memory consistency
model": how writes from one thread are allowed to appear in other
threads. If you write to two memory locations A, then B; and some
other thread reads your new value in B, you might hope that they'd then
also read your new value for A. Unfortunately, your write to A
might be stalled out in your store buffer (on a TLB miss, for example),
and so the B write shows up before your A write. Many machines
require a "memory fence" operation to make sure all previous writes are
externally visible.
Example: lock-free hashtable design
There's a good presentation on Dr. Cliff Click's lock-free hashtable
(in Java, for "Azul" native Java 54-core/chip massive multicore
hardware). The read side is relatively straightforward,
especially using his idea of just exposing the memory consistency model
to the user. Lock-free writes are more complex, especially when
resizing the hashtable.
Here's my crude take on his ideas, leaving out the tricky resize
step. This has survived only minimal testing, so don't build your
life support system on this!
#include <omp.h>
/* For performance debugging: count hashtable collisions */
int collisions=0;
/* A lockless hashtable. Both reads and writes are lock-free.
Does not yet implement hashtable resize. */
template <class KEY, class VALUE>
class LockFreeHashtable {
KEY missingk; /* invalid key */
VALUE missingv; /* invalid value */
long size;
struct KEY_VALUE {
KEY k;
VALUE v;
};
volatile KEY_VALUE *data;
void operator=(const LockFreeHashtable &) {} /* do not copy us */
public:
LockFreeHashtable(long size_,KEY missingk_=KEY(), VALUE missingv_=VALUE())
:missingk(missingk_), missingv(missingv_),
size(0), data(0)
{
reinit(size_);
}
~LockFreeHashtable() { delete[] data; }
/* Reinitialize to be this new size */
void reinit(long size_) {
delete[] data;
size=size_;
data=new KEY_VALUE[size];
for (int i=0;i<size;i++) {
data[i].k=missingk;
data[i].v=missingv;
}
}
/* Read the VALUE for this KEY */
volatile const VALUE &get(const KEY &k) {
int idx=k;
while (true) { /* while we haven't found that key yet */
idx &= size-1; /* === idx %= size; */
if (data[idx].k==k) { /* old key */
return data[idx].v;
}
if (data[idx].k==missingk) { /* missing key */
return missingv;
}
idx++; /* move down: keep looking! */
#pragma omp atomic
collisions++;
}
}
/* Write a copy of VALUE for this KEY */
void put(const KEY &k,const VALUE &v) {
int idx=k;
while (true) { /* while we haven't found that key yet */
idx &= size-1; /* === idx %= size; */
check_key:
if (data[idx].k==k) { /* fast path: reuse old key */
data[idx].v=v;
return;
}
if (data[idx].k==missingk) { /* try to claim new key */
data[idx].k=k; /* provisional ownership */
goto check_key; /* subtle: check ownership below *before* use */
}
idx++; /* move down: keep looking! */
#pragma omp atomic
collisions++;
}
}
/* Count number of valid values in table */
int count(void) {
int sum=0;
for (int i=0;i<size;i++) {
if (data[i].k!=missingk) sum++;
}
return sum;
}
};
LockFreeHashtable<int,float> h(0, -1,-99.0);
enum {n=100000}; /* total number of hashtable operations */
inline int make_key(int i) {
i*=8193; /* randomizing function */
return i^(i>>16);
}
int do_hashtable_writes(void) {
collisions=0;
#pragma omp parallel for
for (int i=0;i<n;i++)
h.put(make_key(i),i*1.23456);
return 0;
}
int do_hashtable_reads(void) {
collisions=0;
double sum=0.0;
#pragma omp parallel for reduction(+:sum)
for (int i=0;i<n;i++)
sum+=h.get(make_key(i));
return sum;
}
int foo(void) {
for (int nthreads=1;nthreads<=4;nthreads*=2) {
h.reinit(1024*512);
printf("%d threads:\n",nthreads);
omp_set_num_threads(nthreads);
double t=time_function(do_hashtable_writes);
printf(" writes: %.3f ns per (%.3f ms total)\n",t*1.0e9/n,t*1.0e3);
std::cout<<" collisions: "<<collisions<<"\n";
std::cout<<" total values: "<<h.count()<<"\n";
t=time_function(do_hashtable_reads);
printf(" reads: %.3f ns per (%.3f ms total)\n",t*1.0e9/n,t*1.0e3);
std::cout<<" collisions: "<<collisions<<"\n";
}
return 0;
}
(Try this in NetRun now!)
Scalability seems relatively good, as long as collisions are rare.
The above example is for ints and floats, which are relatively safe. For dynamically allocated objects, the "ABA problem" makes deallocation quite tricky--Java's solution to this is garbage collection. There's a neat solution known as Hazard Pointers.