cbloom rants: 07-20-11 - Some notes on condition vars

Let's talk about condition vars a bit. To be clear I won't be talking about the specific POSIX semantics for "condition_var" , but rather the more general concept of what a CV is.

As mentioned before, a CV provides a mechanism to wait on a mutex-controlled variable. I've always been a bit confused about why you would want CV because you can certainly do the same thing with the much simpler concept of "Event" - but with event there are some tricky races (see waiting on thread events for example) ; you can certainly use Event and build up a system with a kind of "register waiter then wait" mechanism ; but basically what you're doing is build a type of eventcount or CV there.

Anyhoo, the most basic test of CV is something like this :


before : x = 0;

thread 1:
    {
        cv.lock();
        x += 1;
        cv.signal_unlock();
    }

thread 2:
    {
        cv.lock();
        while ( x <= 0 )
        {
            cv.unlock_wait_lock();
        }
        x = 7;
        cv.unlock();
    }

after : x == 7;

in words, thread 2 waits for thread 1 to do the inc, then sets x = 7. Now, what if our CV was not legit. For example what if our unlock_wait_lock was just :

cv.unlock_wait_lock() :
    cv.unlock();
    cv.wait();
    cv.lock();

this code would not work. Why? What could happen is thread 2 runs first and sees x = 0, so it does the unlock. Then thread 1 runs and does x+= 1 but there's no one to signal so it just exits. Then thread 2 does the wait and deadlocks. ruh roh.

In this case, it could be fixed simply by using a Semaphore for the signal and making sure you always Post (inc) even if there is no waiter (that's a way of counting the signals). But that doesn't actually fix it - if you had another thread running that could consume wakeups for this CV, it could consume all those wakeups and you would still go into the wait and sleep.

So, it is not required that "unlock_wait_lock" actually be atomic (and in fact, it is not in most CV implementations), but you must not go into the wait if another thread could get in between the unlock and the wait. There are a few ways to accomplish this. One is to use some kind of prepare_wait , something like :

cv.unlock_wait_lock() :
    cv.prepare_wait();
    cv.unlock();
    cv.retire_wait();
    cv.lock();

prepare_wait is inside the mutex so it can run race-free (assuming signal takes the same mutex) ; it could record the signal epoch (the GUID for the signal), and then retire_wait will only actually wait if the epoch is the same. Calls to signal() must always inc the epoch even if there is no waiter. So, with this method if someone sneaks in after your unlock but before your wait, you will not go into the wait and just loop around again. This is one form of "spurious wakeup" and is why unlock_wait_lock should generally be used in a loop. (note that this is not a bad "thread thrashing" type of spurious wakeup - you just don't go to sleep).

This seems rather trivial, but it is really the crucial aspect of CV - it's why CV works and why we're interested in it. unlock_wait_lock does not need to be atomic, but it sort of acts like it is, in the sense that a lost signal is not allowed to pass between the unlock and the wait.

The next problem that you will see discussed is the issue of "wait generations". Basically if you have N waiters on a CV, and you go into broadcast() (aka notify_all) , it needs to signal exactly those N waiters. What can happen is a new waiter could come in and steal the wakeup that should go to one of the earlier waiters.

As usual I was a bit confused by this, because it's specific to only certain types of CV implementation. For example, if broadcast blocks new waiters from waiting on the CV, there is no need for generations. If your CV signal and wait track their "wait epoch", there is no problem with generations (the epoch defines the generation, if you like). If your CV is FIFO, there is no problem with generations - you can let new waiters come in during the broadcast, and the correct ones will get the wakeup.

So the generation issue mainly arises if you are trying to use a counting semaphore to implement your CV. In that case your broadcast might do something like count the number of waiters (its N), then Post (inc) the semaphore N times. Then another waiter comes in, and he gets the dec on the semaphore instead of the guy you wanted.

The reason that people get themselves into this trouble is that they want to make a CV that respects the OS scheduler; obviously a generic CV implementation should. When you broadcast() to N waiters, the highest priority waiter should get the first wakeup. A FIFO CV with an event per thread is very easy to implement (and doesn't have to worry about generations), but is impossible to make respect OS scheduling. The only way to get truly correct OS scheduling is to make all the threads wait on the same single OS waitable handle (either a Semaphore or Event typically). So, now that all your waiters are waiting on the same handle, you have the generation problem.

Now also note that I said before that using epoch or blocking new waiters from entering during broadcast would work. Those are conceptually simple but in practice quite messy. The issue is that the broadcast might actually take a long time - it's under the control of the OS and it's not over until all waiters are awake. You would have to inc a counter by N in broadcast and then dec it each time a thread wakes up, and only when it gets to zero can you allow new waiters in. Not what you want to do.

I tried to think of a specific example where not having wait generations would break the code ; this is the simplest I could come up with :


before : x = 0

thread 0 :

    cv.lock();
    x ++;
    cv.unlock();
    cv.broadcast();

thread 1 :
    cv.lock();
    while ( x == 0 )
        cv.unlock_wait_lock();
    x = 7;
    cv.signal();
    cv.unlock();

thread 2 :
    cv.lock();
    while ( x != 7 )
        cv.unlock_wait_lock();
    x = 37;
    cv.unlock();

after : x == 37

So the threads are supposed to run in sequence, 0,1,2 , and the CV should control that. If thread 2 gets into its wait first, there can't be any problems. The problem arises if thread 1 gets into its wait first but the wakeup goes to thread 2. The bad execution sequence is like this :


thread 1 : cv.lock()
thread 1 : x == 0
thread 1 : cv.unlock_wait ...

thread 0 : cv.lock()
thread 0 : x++;
thread 0 : cv.unlock()
thread 0 : cv.broadcast .. starts .. sees 1 waiter ..

thread 2 : cv.lock
thread 2 : x != 7
thread 2 : cv.unlock_wait - I go into wait

thread 0 : .. cv.broadcast .. wakes thread 2

deadlock

You can look in boost::thread for a very explicit implementation of wait generations. I think it's rather over-complex but it does show you how generations get made (granted some of that is due to trying to match some of the more arcane requirements of the standard interface). I'll present a few simple implementations that I believe are much easier to understand.

A few more general notes on CV's before we get into implementations :

In some places I will use "signal_unlock" as one call instead of signal() and unlock(). One reason is that I want to convey the fact that signal_unlock is trying to be as close to atomic as possible. (we'll show one implementation where it actually is atomic). The other reason is that the normal CV API that allows signal() outside the mutex can create additional overhead. With the separate API you have either :

lock
..blah..
signal
unlock

which can cause "thread thrashing" if the signal wakes the waiter and then immediately blocks it on the lock. Or you can have :

lock
..blah..
unlock
signal

the problem with this for the implementor is that now signal is outside of the mutex and thus the CV is not protected; so you either have to take the mutex inside signal() or protect your guts in some other way; this adds overhead that could be avoided if you knew signal was always called with the mutex held. So both variants of the separate API are bad and the merged one is preferred.

The other issue that you may have noticed that I combined my mutex and CV. This is slightly more elegant for the user, and is easier for the implementor, because some CV implementations could use knowledge of the guts of the mutex they are attached to. But it's not actually desirable.

The reason it's not great is because you do want to be able to use multiple CV's per mutex. (I've heard it said that you might want to use different mutexes with the same CV, but I can't think of an example where you would actually want to do that).

eg. in our broadcast generation example above, it would actually be cleaner to use a single mutex to protect "x" but then use two different CV's - one that you signal when x = 1, and another that you signal when x = 7. The advantage of this is that the CV signalled state corresponds directly to the condition that the waiter is waiting on.

In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition" , because it reduces unnecessary wakeups. The same could be said for Events or Semaphores or any wakeup mechanism - using wakeups to kick threads to check themselves is not desirable.

We're going to go off on a tangent a bit now.

What would be ideal in general is to actually be able to sleep a thread on a predicate. Instead of doing this :


    cv.lock();
    while ( x != 7 )
        cv.unlock_wait_lock();

you would so something like :


    cv.wait( { x == 7 } );

where it's implied that the mutex is locked when the predicate is checked. This is actually not hard to implement. In your signal() implementation, you can walk the list of waiters in the waitset (you are holding the mutex during signal) and just call predicate() on each waiter and see if he should wake up.

The big advantage is that only the threads that actually want to wake up get woken. Dmitry has done an implementation of something like this and calls it fine grained condvar/eventcount (also submitted to TBB as fine grained concurrent monitor ).

As noted before, using multiple CV's can be a primitive way of getting more selective signalling, if you can associate each CV with a certain condition and wait on the right one.

A nice version of this that some operating systems provide is an event with a bit set. They use something like a 32-bit mask so you can set 32 conditions. Then when you Wait() you wait on a bit mask which lets you choose various conditions to wake for. Then signal passes a bit mask of which to match. This is really a very simplified version of the arbitrary predicate case - in this case the predicate is always (signal_bits & waiting_bits) != 0 (or sometimes you also get the option of (signal_bits & waiting_bits) == waiting_bits so you can do a WAIT_ANY or WAIT_ALL for multiple conditions).

The bit-mask events/signals would be awesome to have in general, but they are not available on most OS'es so oh well. (of course you can mimic it on windows by making 32 Events and using WFMO on them all).

Next up, some condvar implementations...

cbloom rants

7/20/2011

07-20-11 - Some notes on condition vars

No comments:

old rants