7/20/2011

07-20-11 - A cond_var that's actually atomic

So, all the cond vars that we've seen so far (and all the other implementations I've seen) , don't actually do { signal_unlock } atomically.

They make cond var work even though it's not atomic, in ways discussed previously, ensuring that you will only ever get false wakeups, never missed wakeups. But false wakeups are still not ideal - they are a performance bug. What we would really like is to minimize thread switches, and also ensure that when you set a condition, the guy who wakes up definitely sees it.

For example, normal cond var implementations require looping in this kind of code : (looping implies waking up and then going back to sleep)


thread 1 :
  cv.lock();
  x = 3;
  cv.signal_unlock();

thread 2 :
  cv.lock();
  if ( x != 3 ) // normal cond var needs a while here
    cv.unlock_wait_lock();
  ASSERT( x == 3 );
  cv.unlock();

(obviously in real code you will often have multiple signals and other things that can cause races, so you generally will always want the while to loop and catch those cases, even if the condvar doesn't inherently require it).

Furthermore, if you jump back in and mess with the condition, like :


thread 1 :
  cv.lock();
  x = 3;
  cv.signal_unlock();

  cv.lock();
  x = 7;
  cv.unlock();

most cond_vars don't gaurantee that thread 2 necessarilly sees x == 3 at all. The implementation is free to send the signal, thread 2 wakes up but doesn't get the lock yet, thread 1 unlocks then relocks, sets x = 7 and unlocks, now thread 2 gets the lock finally and sees x is not 3. If signal_unlock is atomic (and transfers ownership of the mutex directly to the waiter) then nobody can sneak in between the signal and when the receiver gets to see the data that triggered the signal.

One way to do it is to use a mutex implementation in which ownership of the mutex is transferred directly by Event signal. For example, the per-thread-event-mutex from an earlier post. To unlock this mutex, you first do some maintenance, but the actual ownership transfer happens in the SetEvent(). When a thread owns the mutex, anyone else trying to acquire the mutex is put into a wait state. When one of them wakes up it is now the owner (they can only be woken from unlock).

With this style of mutex, we can change our ownership transfer. Instead of unlocking the mutex and handing it off to the next waiter to acquire the mutex, you hand it off to the next waiter directly on the signal.

In pseudo-code it looks like this :


lock :
    if owner == null, owner = me
    else
        add me to lock-waiter list
        wait me

unlock :
    set owner = pop next off lock-waiter list
    wake owner

unlock_wait_lock :
    add me to signal-waiter list
    set owner = pop next off lock-waiter list
    wake owner
    wait me
    // mutex is locked by me when I wake up

signal_unlock :
    set owner = pop next off signal-waiter list
    wake owner

reasonably easy. Conceptually simple ; it's just like a normal mutex, except that instead of one list of threads waiting for the lock, there are two lists - one of threads waiting for the lock to come from "unlock" and one waiting for the lock to come from "signal_unlock". This is (almost (*)) the absolute minimum of thread transfers; signal_unlock atomically unlocks the mutex and also unlocks a waiter in the same operation. unlock_wait_lock then has to do an unlock and then when it wakes from wake it owns the mutex.

Note that there is still one non-atomic gap, in between "unlock" and "wait_lock" in the unlock_wait_lock step. You can use SignalObjectAndWait there on Windows but as noted previously that is not actually atomic. But maybe less likely to thread switch in the gap. (* = this is the only spot where we can do a thread transfer we don't want; we could wake a thread and in so doing lose our time slice, then if we get execution back we immediatley go into a wait)

Anyway, here is a working version of an atomically transferring cond var built on an MCS-style mutex. Some notes after the code.

at pastebin

Notes :

1. broadcast is SCHED_OTHER of course. TODO : fix that. It also has to relock the mutex each time around the loop in order to transfer it to the next waiter. That means broadcast can actually thread-thrash a lot. I don't claim that this implementation of broadcast is good, I just wanted to prove that broadcast is possible with this kind of condvar. (it's really most natural only for signal)

2. I use a mutex to protect the waitlist. That was just for ease and simplicity because it's already difficult and complicated without trying to manage the waitlist lockfree. TODO : fix that.

(I think you could do it by passing in a dummy node to unlock_internal in unlock_wait_lock instead of passing in null; that keeps the mutex from becoming free while you build the wait list; then after the wait list is built up, get the dummy node out; but this is not quite trivial ; for the most part the external mutex protects the waitlist so this is the only spot where there's an issue).

(as usual the valid complaint about the mutex is not that it takes 100-200 clocks or whatever, it's that it could cause unnecessary thread switches)

3. As noted in the code, you don't actually have to alloc & free the event handles, they can come from the TLS since there is always just one per thread and it's auto-reset.

Anyway, I think it's an interesting proof of concept. I would never use atomics that are this complex in production code because it's far too likely there's some subtle mistake which would be absolute hell to track down.

No comments:

old rants