7/16/2011

07-16-11 - Ticket FIFO Mutex

The Linux kernel internally uses a FIFO spinlock that they call "ticket lock". A ticket or "bakery" algorithm is quite a common pattern so we'll have a glance.

The analogy is the easiest way to understand it. There's an atomic ticket machine, when you walk into the shop you grab a ticket (and the machine increments itself). On the wall is a "now serving" sign that counts up as people turn in their tickets.

This can be implemented most obviously using two ints :


struct ticket_mutex2
{
    // (*0)
    std::atomic<unsigned int> m_ticket;
    std::atomic<unsigned int> m_serving;

    ticket_mutex2()
    {
        m_ticket($).store( 0 );
        m_serving($).store( 0 );
    }
    ~ticket_mutex2()
    {
    }

    void lock()
    {
        unsigned int me = m_ticket($).fetch_add(1,std::mo_acq_rel);
    
        rl::linear_backoff bo;
        for(;;)
        {
            unsigned int cur = m_serving($).load(std::mo_acquire);
            if ( me == cur )
                return;
        
            bo.yield($);
            
            // (*1)
        }
    }

    void unlock()
    {
        // (*2)
        // my ticket must match m_serving
        // (*3)
        //m_serving($).fetch_add(1,std::mo_release);
        unsigned int cur = m_serving($).load(std::mo_relaxed);
        m_serving($).store(cur+1,std::mo_release);
    }
};

*0 : obviously you could put the two counters into words and mush them in one int (Linux on x86 used to put them into bytes and mush them into one word), but it's actually a better demonstration of the algorithm to have them separate, because it's a weaker constraint. Lockfree algorithms always continue to work if you mush together variables into larger atomic pieces, but rarely continue to work if you separate them into smaller independent atomic pieces. So when you're trying to show the fundamental requirements of an algorithm you should use the minimum mushing-together required.

(BTW I don't remotely claim that any of the things I've posted have the minimum synchronization constraints required by the algorithm, but that is always the goal).

*1 : you might be tempted to put a Wait here using eventcount or something, but you can't. The problem is if multiple threads go to sleep there, only the one thread that has the next ticket will be able to take the lock. So if you use a generic waitset, you might wake the wrong thread, it won't be able to get in, and you will deadlock. More on this in a moment.

*2 : m_serving is actually protected by the mutex, it is only ever modified by the mutex holder. m_ticket is actually the barrier variable for acquiring the lock. When you get the lock you could store your ticket id as a member in the lock struct and at unlock it will be equal to m_serving.

*3 : you can of course use an atomic increment on serving but because of *2 it's not necessary, and a simple load & inc is cheaper on some architectures (and as per *1, it's a weaker constraint so we prefer to demonstrate its correctness here).

Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock.

So if we want to turn this into a FIFO lock that works in user space, we have to have a sleep/wakeup mechanism.

I don't think this is actually an awesome way to write your own FIFO lock, but it's a nice demonstration of the usefulness of NT's Keyed Events, so I'm going to do that.

You need to get the secret functions :


template <typename ret,typename T1,typename T2,typename T3,typename T4>
ret CallNtInternal(const char * funcName,T1 arg1,T2 arg2,T3 arg3,T4 arg4)
{
    typedef ret NTAPI t_func(T1,T2,T3,T4);

    t_func * pFunc = (t_func*) GetProcAddress( LoadLibrary( TEXT("ntdll.dll") ), funcName );
    ASSERT_RELEASE( pFunc != NULL );

    return (*pFunc) (arg1,arg2,arg3,arg4);
}

#define MAKE_NTCALL_4(ret,func,type1,type2,type3,type4) ret func(type1 arg1,type2 arg2,type3 arg3,type4 arg4) { return CallNtInternal<ret>(#func,arg1,arg2,arg3,arg4); }

MAKE_NTCALL_4( LONG,NtCreateKeyedEvent,OUT PHANDLE, IN ACCESS_MASK, IN PVOID, IN ULONG );
MAKE_NTCALL_4( LONG,NtReleaseKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER ); 
MAKE_NTCALL_4( LONG,NtWaitForKeyedEvent,IN HANDLE, IN PVOID, IN BOOLEAN, IN PLARGE_INTEGER );

and then you can make the lock :


struct ticket_mutex2_keyed
{
    std::atomic<unsigned int> m_state;
    // ticket is bottom word
    // now serving is top word

    HANDLE  m_keyedEvent;

    // keyed event must have bottom bit off :
    enum { WAITKEY_SHIFT = 1 };

    ticket_mutex2_keyed()
    {
        m_state($).store( 0 );
        NtCreateKeyedEvent(&m_keyedEvent,EVENT_ALL_ACCESS,NULL,0);
    }
    ~ticket_mutex2_keyed()
    {
        CloseHandle(m_keyedEvent);
    }

    void lock()
    {
        // grab a ticket and inc :
        unsigned int prev = fetch_add_low_word(m_state($),1);
    
        // if ticket matches now serving I have the lock :
        if ( top_word_matches_bottom(prev) )
            return;
    
        // wait on my ticket :
        unsigned int ticket = prev&0xFFFF;
        intptr_t waitKey = (ticket<<WAITKEY_SHIFT);
        NtWaitForKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
    }

    void unlock()
    {
        // inc now serving :
        unsigned int prev = m_state($).fetch_add((1<<16),std::mo_release);

        // get a local copy of the "now serving" that I published :
        prev += (1<<16);

        // if lock was not made open to new entries :       
        if ( ! top_word_matches_bottom(prev) )
        {
            // wake up the one after me in the sequence :
            unsigned int next = (prev>>16);
            intptr_t waitKey = (next<<WAITKEY_SHIFT);
            NtReleaseKeyedEvent(m_keyedEvent,(PVOID)(waitKey),FALSE,NULL);
        }
    }
};

Note that we have had to push together our two state variables now, because previous unlock only touched the "now serving" counter, but now it has to also check against the ticket counter to see if there are any people waiting.

Also note that we are taking advantage of the fact that ReleaseKeyedEvent is blocking. If the Release happens before the Wait, the signal is not lost - the unlocking thread blocks until the Wait is entered.

Exercise for the reader : make it possible for lock to spin a while before going into the wait.

I made use of these self-explanatory helpers :


bool top_word_matches_bottom( unsigned int x )
{
    unsigned int t = _lrotl(x,16);
    return t == x;
}

unsigned int fetch_add_low_word(std::atomic<unsigned int> & state,int inc)
{
    unsigned int old = state($).load(std::mo_relaxed);
    while ( ! state($).compare_exchange_weak(old,((old+inc)&0xFFFF) | (old & 0xFFFF0000),std::mo_acq_rel) ) { }
    return old;
}

which do what they do.

Obviously on Linux you could use futex, but there are too many platforms that have neither KeyedEvent nor futex, which make using them not very attractive.

Some links :

Time-Published Queue-Based Spin Locks
Ticket spinlocks [LWN.net]
spinlocks XXXKSE What to do
Linux x86 ticket spinlock
git.kernel.org - linuxkernelgittorvaldslinux-2.6.gitcommit
futex(2) - Linux manual page

2 comments:

Anonymous said...

Very nice, but... after fixing the std::memory_order constants to make it compile, it turns out that, ironically, the presumably much faster ticket mutex performs about the same as a simple W32 API critical section with a concurrency of 1 and 2, and is roughly 8 times *slower* with a concurrency of 4 (on a 4-core Intel).

cbloom said...

"Okay, this is a very cheap lock in terms of the number of expensive atomics required, and it's FIFO (fair) which is nice in some cases, but it simply cannot be used outside of a kernel environment. The reason is that if the thread who is next in line get swapped out, then no currently running threads can get the lock, and we don't have any wakeup mechanism to get that sleeping thread to take the lock so we can make progress. This is okay in the kernel because the kernel is controlling which threads are awake or asleep, so obviously it won't put a thread to sleep that is currently spinning trying to get the lock. "

old rants