cbloom rants: 07-17-11 - CLH list-based lock

The multi-way ticket lock we just did is very similar to some classic spin locks. I found this nice page : scalable synchronization pseudocode ( and parent page at cs.rochester ) ( and similar material covered here, but with nice drawings : Mutual Exclusion: Classical Algorithms for Locks (PDF) ).

I wanted to see how the classic MCS queue lock compares to my per-thread-mpsc lock ; the answer is not much. The classic queue locks are really closer kin to the multi-way ticket lock. I'll try to show that now. The MCS lock is probably more well known, but the CLH lock is simpler, so I'll deal with that.

The idea of these locks is to avoid the heavy cache contention inherent to the basic single-variable gate locks. To solve that, the idea is to use a distributed gate; basically one gate variable per waiter, and it's the responsibility of the unlocker to open the gate for the next waiter. So there has to be some kind of linked list so that the unlocker can find the next waiter. And these locks will be inherently FIFO and SCHED_OTHER and all that. (these are really only appropriate for kernels or kernel-like environments)

The CLH algorithm is usually described as a linked list, with the "head" of the list being the node that currently has access to the mutex, and the "tail" being the variable held in the lock struct. When new waiters come in, they tack onto the tail, thus it's FIFO.

There's a node for each waiter, and each node contains the gate for the guy after me :


struct clh_node
{
    // list is linked from tail backwards :
    std::atomic<clh_node *> prev;
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node()
    {
        prev($).store(0);
        succ_must_wait($).store(0);
    }
};

we also need a way of providing a node per-thread ! *per-lock* ! ; this is different than my event-queue-mutex that just needs a node *per-thread* ; the reason is that the nodes in CLH keep getting used even after you unlock, so you can't just reuse them. However, you can free some node when you unlock - just not necessarily the one you passed in. So anyhoo, we need some struct to pass in this node for us, here it is :


struct ThreadNode
{
    std::atomic<clh_node *> pNode;
    
    ThreadNode()
    {
        pNode($).store( new clh_node );
    }
    ~ThreadNode()
    {
        // note that the pNode I delete might not be the one I created
        //  so don't try to hold it by value
        clh_node * n = pNode($).exchange(0, std::mo_relaxed);
        delete n;
    }
};

this could be in the TLS, or it could be in the mutex::guard , or whatever.

Okay, now that we have our helpers we can write the code. When the mutex is held, the tail node will have succ_must_wait = 1 , when you take the lock you stick yourself on the tail and then wait on your predecessor. To unlock the mutex you just set succ_must_wait = 0 on yourself, and that allows the guy after you to go :


struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_tail;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_tail($).store( new clh_node );
    }
    ~clh_mutex()
    {
        clh_node * n = m_tail($).exchange(0);
        delete n;
    }
    
    void lock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        //me->prev($).store(0, std::mo_relaxed );
        clh_node * pred = m_tail($).exchange(me, std::mo_acq_rel);
        me->prev($).store(pred, std::mo_relaxed );
        
        // wait on predecessor's flag -
        //  this is why pred can't free himself
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
    }
    
    void unlock(ThreadNode * I)
    {
        clh_node * me = I->pNode($).load(std::mo_relaxed);
        
        clh_node * pred = me->prev($).load(std::mo_relaxed);
        me->succ_must_wait($).store( 0, std::mo_release );
        // take pred's node :
        //  this leaves my node allocated, since succ is still looking at it
        I->pNode($).store( pred, std::mo_relaxed );
    }

};

okay, I think this is reasonably self-explanatory. BTW the reason why the classical locks are the way they are is often to avoid test-and-set ops, which they didn't have or were very expensive; here we use only one exchange, the rest is just loads and stores.

That matches the classical algorithm description, but it's a lot more expensive that necessary. The first thing you might notice is that we don't actually need to store the linked list at all. All we need to do is get "pred" from lock to unlock. So you can either store it in the mutex struct, or put it in the "guard" (ThreadNode in this case) ; I think putting it in the guard is better, but I'm going to put it in the mutex right now because it's more analogous to our next step :


struct clh_node
{
    // should the guy after me wait ?
    std::atomic<int> succ_must_wait;
    
    clh_node() { succ_must_wait($).store(0); }
};

struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    std::atomic<clh_node *> m_lock;
    std::atomic<clh_node *> m_lock_pred;
    std::atomic<clh_node *> m_lock_holder;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store( new clh_node );
        m_lock_pred($).store( 0 );
        m_lock_holder($).store( 0 );
    }
    ~clh_mutex()
    {
        clh_node * n = m_lock($).exchange(0);
        delete n;
    }

    clh_node * alloc_slot()
    {
        return new clh_node;
    }
    void free_slot(clh_node * p)
    {
        delete p;
    }
    
    void lock()
    {
        clh_node * me = alloc_slot();
    
        me->succ_must_wait($).store( 1, std::mo_relaxed );
        clh_node * pred = m_lock($).exchange(me, std::mo_acq_rel);
        
        rl::linear_backoff bo;
        while ( pred->succ_must_wait($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        m_lock_holder($).store(me, std::mo_relaxed );
        m_lock_pred($).store(pred, std::mo_relaxed );
    }
    
    void unlock()
    {
        clh_node * me = m_lock_holder($).load(std::mo_relaxed);
        
        clh_node * pred = m_lock_pred($).load(std::mo_relaxed);
        
        me->succ_must_wait($).store( 0, std::mo_release );

        free_slot( pred );
    }

};

and rather than pass in the nodes I just bit the bullet and allocated them. But now the obvious thing to do is make alloc_slot and free_slot just take & return nodes from an array. But then "me" is just stepping a pointer through an array. So our "linked list" should just be a sequence of adjacent elements in an array :


struct clh_mutex
{
public:
    // m_lock points to the tail of the waiter list all the time
    #define NUM_WAYS    16
    // should be cache line sized objects :
    std::atomic<int> succ_must_wait[NUM_WAYS];
    std::atomic<int> m_lock;
    VAR_T(int) m_lock_pred;

    clh_mutex()
    {
        // make an initial dummy note - must have succ_must_wait = 0
        m_lock($).store(0);
        succ_must_wait[0]($).store(0);
        for(int i=1;i<NUM_WAYS;i++)
        {
            succ_must_wait[i]($).store(1);
        }
        m_lock_pred($) = 0;
    }
    ~clh_mutex()
    {
    }

    void lock()
    {   
        int pred = m_lock($).fetch_add(1, std::mo_acq_rel);
        pred &= (NUM_WAYS-1);
        
        rl::linear_backoff bo;
        while ( succ_must_wait[pred]($).load(std::mo_acquire) )
        {
            bo.yield($);
        }
        
        // m_lock_pred just remembers my index until unlock
        //  could be a local
        m_lock_pred($) = pred;
    }
    
    void unlock()
    {
        int pred = m_lock_pred($);
        int me = (pred+1)&(NUM_WAYS-1);
        
        // recycle this slot :
        succ_must_wait[pred]($).store(1, std::mo_relaxed);
        
        // free my lock :
        succ_must_wait[me]($).store( 0, std::mo_release );
    }

};

(as usual, m_lock_pred doesn't really belong as a member variable in the lock).

But this is exactly "Anderson's array-based queue lock" that we mentioned at the end of the ticket-lock post, and it's also just a CLH lock with the nodes stuck in an array. This suffers from the big problem that you must have enough array entries for the threads that will touch the lock or it doesn't work (what happens is multiple threads can get into the mutex at the same time, eg. it doesn't actually provide mutual exclusion).

I don't think this is actually useful for anything, but there you go.

cbloom rants

7/17/2011

07-17-11 - CLH list-based lock

No comments:

old rants