Lock-Free Multicore Datastructures

CS 641 Lecture, Dr. Lawlor

Basic multicore design is pretty simple: just add locks where multiple threads need to modify the same data structure. This works better if there are only a few threads, because the contention for the lock is low.

As multicore machines get bigger, like hundreds or thousands of processors, locks generally don't scale well: if you only have a few locks, lock contention becomes enormous; if you have many fine-grained locks, the lock overhead starts to add up.

So for large multicore machines, "lock-free" designs are common. This is a pretty elastic term, covering a range of strategies:

We don't use locks every time, but only when we really need them (e.g., only for writes, never for reads).
We don't use locks per se, but we build basically the same thing ourselves (loop until free).
We don't use locks, instead we use atomic instructions.
We don't use locks, because we assume multiple threads never access the same data.
We don't use locks, because we can prove that multiple threads can never access the same data at the same time.

One serious complicating factor is the machine's "memory consistency model": how writes from one thread are allowed to appear in other threads. If you write to two memory locations A, then B; and some other thread reads your new value in B, you might hope that they'd then also read your new value for A. Unfortunately, your write to A might be stalled out in your store buffer (on a TLB miss, for example), and so the B write shows up before your A write. Many machines require a "memory fence" operation to make sure all previous writes are externally visible.

Example: lock-free hashtable design

There's a good presentation on Dr. Cliff Click's lock-free hashtable (in Java, for "Azul" native Java 54-core/chip massive multicore hardware). The read side is relatively straightforward, especially using his idea of just exposing the memory consistency model to the user. Lock-free writes are more complex, especially when resizing the hashtable.

Here's my crude take on his ideas, leaving out the tricky resize step. This has survived only minimal testing, so don't build your life support system on this!

#include <omp.h>

/* For performance debugging: count hashtable collisions */
int collisions=0;

/* A lockless hashtable.  Both reads and writes are lock-free.
   Does not yet implement hashtable resize. */
template <class KEY, class VALUE>
class LockFreeHashtable {
	KEY missingk; /* invalid key */
	VALUE missingv; /* invalid value */
	
	long size;
	struct KEY_VALUE {
		KEY k;
		VALUE v;
	};
	volatile KEY_VALUE *data;
	void operator=(const LockFreeHashtable &) {} /* do not copy us */
public:
	LockFreeHashtable(long size_,KEY missingk_=KEY(), VALUE missingv_=VALUE()) 
		:missingk(missingk_), missingv(missingv_),
		size(0), data(0) 
	{
		reinit(size_);
	}
	~LockFreeHashtable() { delete[] data; }

	/* Reinitialize to be this new size */
	void reinit(long size_) {
		delete[] data;
		size=size_;
		data=new KEY_VALUE[size];
		for (int i=0;i<size;i++) {
			data[i].k=missingk;
			data[i].v=missingv;
		}
	}
	
	/* Read the VALUE for this KEY */
	volatile const VALUE &get(const KEY &k) {
		int idx=k;
		while (true) { /* while we haven't found that key yet */
			idx &= size-1; /* ===   idx %= size; */
			if (data[idx].k==k) { /* old key */
				return data[idx].v;
			}
			if (data[idx].k==missingk) { /* missing key */
				return missingv;
			}
			idx++; /* move down: keep looking! */
#pragma omp atomic
			collisions++;
		}
	}
	/* Write a copy of VALUE for this KEY */
	void put(const KEY &k,const VALUE &v) {
		int idx=k;
		while (true) { /* while we haven't found that key yet */
			idx &= size-1; /* ===   idx %= size; */
		  check_key:
			if (data[idx].k==k) { /* fast path: reuse old key */
				data[idx].v=v;
				return;
			}
			if (data[idx].k==missingk) { /* try to claim new key */
				data[idx].k=k; /* provisional ownership */
				goto check_key; /* subtle: check ownership below *before* use */
			}
			idx++; /* move down: keep looking! */
#pragma omp atomic
			collisions++;
		}
	}

	/* Count number of valid values in table */
	int count(void) {
		int sum=0;
		for (int i=0;i<size;i++) {
			if (data[i].k!=missingk) sum++;	
		}
		return sum;
	}	
};

LockFreeHashtable<int,float> h(0, -1,-99.0);

enum {n=100000}; /* total number of hashtable operations */
inline int make_key(int i) {
	i*=8193; /* randomizing function */
	return i^(i>>16);
}
int do_hashtable_writes(void) {
	collisions=0;
#pragma omp parallel for 
	for (int i=0;i<n;i++)
		h.put(make_key(i),i*1.23456);
	return 0;
}
int do_hashtable_reads(void) {
	collisions=0;
	double sum=0.0;
#pragma omp parallel for reduction(+:sum) 
	for (int i=0;i<n;i++)
		sum+=h.get(make_key(i));
	return sum;
}

int foo(void) {
	for (int nthreads=1;nthreads<=4;nthreads*=2) {
		h.reinit(1024*512);
		printf("%d threads:\n",nthreads);
		omp_set_num_threads(nthreads);

		double t=time_function(do_hashtable_writes);
		printf("	writes: %.3f ns per (%.3f ms total)\n",t*1.0e9/n,t*1.0e3);
		std::cout<<"	collisions: "<<collisions<<"\n";
		std::cout<<"	total values: "<<h.count()<<"\n";
	
		t=time_function(do_hashtable_reads);
		printf("	reads: %.3f ns per (%.3f ms total)\n",t*1.0e9/n,t*1.0e3);
		std::cout<<"	collisions: "<<collisions<<"\n";
	}
	
	return 0;
}

(Try this in NetRun now!)

Scalability seems relatively good, as long as collisions are rare.

The above example is for ints and floats, which are relatively safe. For dynamically allocated objects, the "ABA problem" makes deallocation quite tricky--Java's solution to this is garbage collection. There's a neat solution known as Hazard Pointers.