cbloom rants: 07-18-11 - MCS list-based lock

We did CLH before, but MCS really is better, so let's go through it.

MCS is another list/node based lock from the ancient days. The nodes are now linked forwards from head to tail. The "head" currently owns the mutex and the tail is the next slot for someone trying to enter the lock.

Unlike CLH, as soon as you leave the mutex in MCS, your node is no longer in use in the chain, so you can free it right away. This also means you can just keep your node on the stack, and no allocations or any goofy gymnastics are needed.

Also like CLH, MCS has the nice performance advantage of only spinning on a local variable (do cache line padding to get the full benefit of this).

The way it works is quite simple. Each node has a "gate" flag which starts "blocked". You spin on your own gate being open. The previous lock-holder will point at you via his "next" pointer. When he unlocks he sets "next->gate" to unlocked, and that allows you to run.

The code is :


// mcs on stack

struct mcs_node
{
    std::atomic<mcs_node *> next;
    std::atomic<int> gate;
    
    mcs_node()
    {
        next($).store(0);
        gate($).store(0);
    }
};

struct mcs_mutex
{
public:
    // tail is null when lock is not held
    std::atomic<mcs_node *> m_tail;

    mcs_mutex()
    {
        m_tail($).store( NULL );
    }
    ~mcs_mutex()
    {
        ASSERT( m_tail($).load() == NULL );
    }
    
    class guard
    {
    public:
        mcs_mutex * m_t;
        mcs_node    m_node; // node held on the stack

        guard(mcs_mutex * t) : m_t(t) { t->lock(this); }
        ~guard() { m_t->unlock(this); }
    };
    
    void lock(guard * I)
    {
        mcs_node * me = &(I->m_node);
        
        // set up my node :
        // not published yet so relaxed :
        me->next($).store(NULL, std::mo_relaxed );
        me->gate($).store(1, std::mo_relaxed );
    
        // publish my node as the new tail :
        mcs_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        if ( pred != NULL )
        {
            // (*1) race here
            // unlock of pred can see me in the tail before I fill next
            
            // publish me to previous lock-holder :
            pred->next($).store(me, std::mo_release );

            // (*2) pred not touched any more       

            // now this is the spin -
            // wait on predecessor setting my flag -
            rl::linear_backoff bo;
            while ( me->gate($).load(std::mo_acquire) )
            {
                bo.yield($);
            }
        }
    }
    
    void unlock(guard * I)
    {
        mcs_node * me = &(I->m_node);
        
        mcs_node * next = me->next($).load(std::mo_acquire);
        if ( next == NULL )
        {
            mcs_node * tail_was_me = me;
            if ( m_tail($).compare_exchange_strong( tail_was_me,NULL,std::mo_acq_rel) )
            {
                // got null in tail, mutex is unlocked
                return;
            }
            
            // (*1) catch the race :
            rl::linear_backoff bo;
            for(;;)
            {
                next = me->next($).load(std::mo_acquire);
                if ( next != NULL )
                    break;
                bo.yield($);
            }
        }

        // (*2) - store to next must be done,
        //  so no locker can be viewing my node any more        

        // let next guy in :
        next->gate($).store( 0, std::mo_release );
    }
};

there are two subtle "moments" (in Dmitry's terminology) which have been marked in the code.

The main one is (*1) : this is actually exactly analogous to the "unlocking in progress" state that we talked about with the list-event mutex that I designed earlier. In this case it's a "locking in progress". The state is indicated when "m_tail" has been changed, but "->next" has not yet been filled. Because m_tail is exchanged first (it is the atomic sync point for the lock), your node is published before pred->next is set. So the linked list can be broken into pieces and invalid during this phase. But the unlocker can easily detect it and spin to wait for this phase to pass.

The other important point is (*2) - this is what allows you to keep the node on the stack. Your node can be held by another thread, but by the time you get to *2 you know it must be done with you.

So, it's FIFO, SCHED_OTHER, etc. complaints like previous mutexes. But it has a lot of advantages. The mutex itself is tiny, just one pointer. The fast path is one exchange and one CAS ; that's not the best, but it's okay. But the real advantage is its flexibility.

You can change the spin to a Wait() quite trivially. (just store an Event in the node instead of the "gate" flag)

You can spin and try to CAS m_tail to acquire the lock a bit before waiting if you like.

You can combine spinners and waiters! That's unusual. You can use the exact same mutex for a spinlock or a Wait lock. I'm not sure that's ever a good idea, but it's interesting. (one use for this is if you fail to get a handle to wait on, you can just spin, and your app will degrade somewhat gracefully).

Cool!

cbloom rants

7/18/2011

07-18-11 - MCS list-based lock

No comments:

old rants