7/18/2010

07-18-10 - Mystery - Do Mutexes need More than Acquire-Release -

What memory order constraints do Mutexes really need to enforce ?

This is a surprisingly unclear topic and I'm having trouble finding any good information on it. In particular there are a few specific questions :

1. Does either Mutex Lock or Unlock need to be Sequential Consistent? (eg. a global sync/ordering point) (and followup : if they don't *need* be Seq_Cst , is there a good argument for them to be Seq_Cst anyway?)

2. Does either Lock or Unlock need to keep memory accesses from moving IN , or only keep them from moving OUT ? (eg. can Lock just be Acquire and Unlock just be Release ?)

Okay, let's get into it a bit. BTW by "mutex" I mean "CriticalSection" or "Monitor". That is, something which serializes access to a shared variable.

In particular, it should be clear that instructions moving *OUT* is bad. The main point of the mutex is to do :


y = 1;

Lock(x)

  load x
  x ++;
  store x;

Unlock(x)

y = 2;

and obviously the load should not move out the top nor should the store move out the bottom. This just means the Lock must be Acquire and the Unlock must be Release. However, the y=1 could move inside from the top, and the y=2 could move inside from the bottom, so in fact the y=1 assignment could be completely eliminated.

Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this question in a bit of detail, but it's fucking slides so it's hard to understand (god damn slides). Basically he argues that moving code into the Mutex (Java style) is fine, *except* if you allow a "try_lock" type function, which allows you to invert the mutex; with try_lock, then lock() must be a full barrier, but unlock() still doesn't need to be.

Joe Duffy mentions this subject but doesn't come to any conclusions. He does argue that it can be confusing if they are not full barriers . I think he's wrong about that and his example is terrible. You can always cook up very nasty examples if you touch shared variables inside mutexes and also outside mutexes. I would like to see an example where *well written code* behaves badly.

One argument for making them full barriers is that CriticalSection provides full barriers on Windows, so people are used to it, so it's good to give people what they are used to. Some coders may see "Lock" and think code neither moves in or out. But on some platforms it does make the mutex much more expensive.

To be concrete, is this a good SpinLock ?


Lock()
{
    
    while ( ! CAS( lock , 0 , 1 , memory_order_seq_cst )
    ;

}

Unlock()
{
    StoreRelease( lock , 0 );

    // AtomicExchange( lock, 0 , memory_order_seq_cst );
}

One issue that Joe mentions is the issue of fairness and notifying other processors. If you use the non-fencing Unlock, then you aren't immediately giving other spinning cores a change to grab your lock; you sort of bias towards yourself getting the lock again if you are in high contention. IMO this is a very nasty complex issue and is a good reason not to roll your own mutexes; the OS has complex mechanisms to prevent live locks and starvation and all that shit.

For more concreteness - Viva64 has a nice analysis of Dmitriy V'jukov's implementation of the Peterson Lock . This is a specific implementation of a lock which does not have *any* sequence point; the Lock() is Acquire_Release ordered (so loads inside can't move up and stores before it can't move in) and Unlock is only Release ordered.

The question is - would using a minimally-ordering Lock implementation such as Dmitriy's cause problems of any kind? Obviously Dmitriy's lock is correct in the sense of providing mutual exclusion and data race freedom, so the issue is not that; it's more a question of whether it causes practical programming problems or severely unexpected behavior. What about interaction with file IO or other non-simple-memory-access resources? Is there a good reason not to use such a minimally-ordering lock?

6 comments:

Brian said...

Certainly moving stuff in a lock could cause problems. Supposed you allow moving a wait loop into the lock or a yield statement? Not that a compiler would actually do such a thing, but if you are not careful you could sacrifice liveness.

cbloom said...

" Certainly moving stuff in a lock could cause problems. Supposed you allow moving a wait loop into the lock or a yield statement? Not that a compiler would actually do such a thing, but if you are not careful you could sacrifice liveness. "

Hmm. Interesting point. Yield should never cause a problem (I don't think). Any point in your program could be a yield or not, it should never affect correctness.

Wait() in the Windows sense (Wait on Event) is of course a problem. Presumably the compiler won't move around anything like that.

cbloom said...

BTW a more direct issue is moving mutexes against each other and thereby causing deadlock.

If this :

Lock(A)
..
Unlock(A)
Lock(B)
..
Unlock(B)


Can be rearranged to this :

Lock(A)
..
Lock(B)
Unlock(A)
..
Unlock(B)

you can have deadlock. So Locks must never be able to move past unlocks.

That is satisfied if Lock is Acquire_Release and Unlock is Release. In that case Lock has a #StoreStore barrier before it, so it can't move up before the Unlock.

Dmitry Vyukov said...

Hi Charles,

> Hans Boehm : Reordering Constraints for Pthread-Style Locks goes into this question in a bit of detail, but it's fucking slides...

Check out the technical report:
http://www.hpl.hp.com/techreports/2005/HPL-2005-217R1.pdf

> 1. Does either Mutex Lock or Unlock need to be Sequential Consistent?

I think that they do not need to. Mutex provides mutual exclusion and synchronization for a *particular* data, and you do not care what happens with outside world. And if you do care you either have data races, or have acquired mutexes for that other data and we back to where we begin.

> 2. Does either Lock or Unlock need to keep instructions from moving IN?

I think that they do not need to. It's either indistinguishable or you have races.

As for fairness, I believe that it's just improper level for fairness in general case. The low level primitive must be fast in the first place, and fairness must be handled on higher architectural levels (for example, with FIFO queues, or special fair mutexes (which are different thing from the low-level synchronization primitive)).

However there is a quirk. Mutex unlock must indeed be not able to sink below another mutex lock on *program* level, this can lead to a deadlock.
You not actually need Acquire_Release for lock to achieve that, Acquire_Release is too strong, it will prevent sinking on *hardware* level.
What you need is just a compiler fence (std::atomic_signal_fence() (or maybe even just C++ volatile) to prevent code movement on compiler level, and compiler+hardware fence to prevent code movement OUT.

Btw, who has a nice article with analysis of my implementation is Anthony Williams. And Viva64 are Russian folks who deal with static verification of OpenMP and 64-bit C/C++ code, they have a nice interview with me on Relacy Race Detector (http://www.viva64.com/articles/code-analyzers/) :)

--
Dmitriy V'jukov

Dmitry Vyukov said...
This comment has been removed by the author.
Anonymous said...

I understand that there is no need for mutex operations to be Seq_Cst, because even non-sc acquire semantics on lock and release on unlock provides SC (total order) for all the data which are protected by different mutexes.

Just try to see non-SC behaviour of data using mutexes to access them. You would not be able! :-)

Mutexes provide SC, even implemented with non-sc operations.

old rants