7/17/2011

07-17-11 - Atman's Multi-way Ticket Lock

But first - Atman sent me a note about the ticket lock which is related and worth sharing.

Imagine we're in a "high performance computing" type environment, we have a mess of threads running, locked on the processor, which is the appropriate time to use something like a ticket lock. Now, they all try to get the lock. What actually happens?

N threads run into the mutex and 1 gets the lock. So the remaining (N-1) go into their spin loops :


while ( m_serving($).load(std::mo_acquire) != my_ticket ) ;

but what is this actually doing? When you try to load that shared variable, what it does is something like :
is this cache line current?
okay read the variable from my copy of the cache line
now the first guy holding the mutex unlocks it. This marks the cache line dirty and that is propagated to all the other cores. Now the remaining (N-1) spinners try to read m_serving again. But this time it does :
is this cache line current?
no it's not, get the new copy of the cache line
okay read the variable from my copy of the cache line
the cache line had to be copied around (N-1) times. You can see the pattern, and to do N unlocks with N waiters you wind up doing N^2 cache line copies. Obviously this is not okay for N large.

(note that this is why putting a backoff pause in your spin loop can actually be a performance advantage even on non-hyperthreaded cores - it reduces cache line traffic ; also in the special case of the ticket lock, the waiters actually know how far they are from the front of the list, so they can do "proportional backoff" by subracting "my_ticket" from "now serving" and pausing for that amount of time).

Okay, but we can do better. Leave m_ticket as a single gate variable, but split m_serving into many "ways" (ways in the cache sense). So depending on your ticket # you look at a different serving number. This is just like a very large bakery - rather than have a single "now serving" sign, we have one for odd tickets and one for even tickets ; you stand on the side of the room with the appropriate sign for you and just read that sign.

The code is :


struct ticket_mutex_ways
{
    enum { NUM_WAYS = 16 };
    std::atomic<unsigned int> m_ticket;
    // note : m_serving objects should be cache-line-size padded :
    std::atomic<unsigned int> m_serving[NUM_WAYS];
    VAR_T(int)  m_lock_holder;
    
    ticket_mutex_ways()
    {
        m_ticket($).store( 0 );
        for(int i=0;i<NUM_WAYS;i++)
            m_serving[i]($).store( 0 );
    }
    ~ticket_mutex_ways()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel); // mo_acquire ; *1
    
        int way = me % NUM_WAYS;
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving[way]($).load(std::mo_acquire);
            if ( me == cur )
                break;
        
            bo.yield($);
        }
        
        m_lock_holder($) = me;
    }

    void unlock()
    {
        int next = m_lock_holder($) + 1;
        int way = next % NUM_WAYS;
        
        m_serving[way]($).store(next,std::mo_release);
    }
};

the key thing is that in the spin loop you are only touching the serving variable in your way, and there is no cache contention with up to NUM_WAYS lockers. (as noted - you need cache line size padding between the variables)

(*1 standard caveat here - this only needs to be acquire but they you need other mechanisms to protect your mutex, so beware)

Note that "m_lock_holder" really doesn't belong in the mutex structure; it's an annoyance that I have to put it there; it should just be held on the stack until the unlock. If you use some kind of "guard" class to wrap your mutex lock/unlock it would be more appropriate to store this in the guard. (in fact it is probably good class design to make lock & unlock take the "guard" as a parameter, because that allows more flexibility).

This is pretty cool, I think it's about as fast as you can get if you don't care about your mutex being rather large. One nice thing about it is that you don't need to know your number of CPUs. There are a lot of similar algorithms that break unless NUM_WAYS is >= number of threads. (for example, you can do basically the same thing but just use a bool in each way to indicate locked or not, and that works fine as long as num threads is < num ways; that would be "Anderson's array-based queue lock" BTW). With Atman's algorithm, you can even choose to make WAYS less, and you will be fast as long as the number of contenders is less than the # of ways.

No comments:

old rants