cbloom rants: 07-09-11 - TLS for Win32

7/09/2011

07-09-11 - TLS for Win32

So as noted in previous discussion of TLS , there's this annoying problem that it's broken for DLL's in Win32 (pre-Vista).

The real annoyance as a library writer is that even if you compile a .lib, somebody might want to use your lib in a DLL, and you can't know that in advance (I guess you could build a separate version of your lib for people who want to put it in a DLL), and even if you do make a DLL version it's annoying to the client to have to hook yourself up to the DLL_PROCESS_ATTACH to set up your TLS (if you want to use the MS-recommended way of doing TLS in DLLs). It just doesn't work very well for modular code components. The result is that if you are writing code that is supposed to always work on Win32 you have to do your own TLS system.

(same is true for Xenon XEX's and maybe PS3 PRX's though I'm not completely sure about that; I'm not aware of any other platforms that are broken like this, but there probably are some).

Anyway, so you want TLS but you can't use the compiler's built-in "__thread" mechanism. You can do something like this :


#define TLSVAR_USE_CINIT
//#define TLSVAR_USE_COMPILER_TLS

// T has to be castable to void *
template <typename T>
struct TLSVar
{
public:

    // shared between threads :
    uint32 m_tls_index;
        
    // AllocIndex is thread-safe
    // it does wait-free speculative singleton construction
    static uint32 AllocIndex(volatile uint32 * pSharedIndex)
    {
        uint32 index = LoadRelaxed(pSharedIndex);
        if ( index == TLS_OUT_OF_INDEXES )
        {
            index = TlsAlloc();
            // store my index :
            uint32 oldVal = AtomicCMPX32(pSharedIndex,TLS_OUT_OF_INDEXES,index);
            if ( oldVal != TLS_OUT_OF_INDEXES )
            {
                // already one in there
                TlsFree(index);
                index = oldVal;
            }
        }
        return index;
    }

    #ifdef TLSVAR_USE_CINIT
    TLSVar() : m_tls_index(TLS_OUT_OF_INDEXES)
    {
        AllocIndex(&m_tls_index);
    }
    #endif
    
    T & Ref()
    {
        #ifndef TLSVAR_USE_CINIT
        AllocIndex(&m_tls_index);
        #endif
        
        // initial value in TLS slot :
        //  this has to be done once per thread
        LPVOID tls = TlsGetValue(m_tls_index);
        if ( tls == NULL )
        {
            T * pT = new T;
            tls = (LPVOID)pT;
            TlsSetValue(m_tls_index,tls);
        }
        
        return *((T *) tls);
    }
    
    operator T & ()
    {
        return Ref();
    }
    
    void operator = (const T t)
    {
        Ref() = t;
    }
    
};

#ifdef TLSVAR_USE_COMPILER_TLS

#ifdef _MSC_VER
#define TLS_VAR(type,name)  __declspec(thread) type name = (type)0;
#else
#define TLS_VAR(type,name)  __thread type name = (type)0;
#endif

#else // TLSVAR_USE_COMPILER_TLS

#ifdef TLSVAR_USE_CINIT
#define TLS_VAR(type,name) TLSVar<type> name;
#else
// use static initializer, not cinit :
#define TLS_VAR(type,name) TLSVar<type> name = { TLS_OUT_OF_INDEXES };
#endif

#endif // TLSVAR_USE_COMPILER_TLS

A few notes :

I made it able to work with cinit or without. The cinit version is somewhat preferrable. I'm not sure if cinit always works on all platforms with modular code loading, so I made it optional.

AllocIndex uses the preferred modern way of instantiating shared singletons. It is "wait free", which means all threads always makes progress in bounded time. In the case of contention there is an unnecessary alloc and free, which is unlikely and usually not a big deal. Whenever an extra alloc/free is not a big deal, this is the best way to do a singleton. If the extra alloc/free is a big deal, then a block is preferred.

Some platforms have a small limit on the number of TLS slots. If you use the compiler __thread mechanism, all your TLS variables get put together in a struct that goes in one TLS slot. If you can't use that, then it's probably best to do the same thing by hand - make a struct that contains everything you want to be thread-local and then just use a single slot for the struct. Unfortunately this is ugly for software engineering as many disparate systems might want to use TLS and they all have to share a globally visible struct def.

Handling freeing at thread shutdown is an annoyance. The pthreads tls mechanism lets you register a function callback for each tls slot which can do freeing at thread shutdown. I'm sure there's some way to get a thread-shutdown callback in Windows. Personally I prefer to use a model where all my threads live for the lifetime of the app (there are no short-lifetime threads), so I just don't give a shit about cleaning up the TLS, but that may not be acceptible to everyone, so you will have to deal with this.

7 comments:

Joerg said...: SUPER *cough* UX probably beats it...
For a low thread count using the tread-ID in hashmap was fastest.; July 10, 2011 at 12:42 PM
Anthony Williams said...: You could always just use boost::thread_specific_ptr, which handles these issues for you (or steal the code for your own implementation).; July 11, 2011 at 12:17 AM
cbloom said...: "You could always just use boost::thread_specific_ptr, which handles these issues for you (or steal the code for your own implementation)."

Well, I had a look, and it just doesn't look very good.

They appear to use a linear linked list search to find a given key/data pair in their one TLS slot.

Accessing a TLS var should not be orders of magnitude slower than accessing a local var.

If you've bitten off the bullet and are using boost::thread it's okay, but as a source of code to grab it's pretty gross.; July 11, 2011 at 8:45 AM
Anthony Williams said...: Yes, boost TLS uses a linked list in order to avoid overuse of TLS slots. If you don't like that aspect, it's easy to change.

Anyway, the specific issues that I was thinking of were the cleanup issues. You've said you don't care, since your threads run until the end of the process, but for those that do care, boost takes care of it for you. The code to invoke callbacks on thread exit is in tss_dll.cpp and tss_pe.cpp.; July 13, 2011 at 11:52 PM
cbloom said...: " Anyway, the specific issues that I was thinking of were the cleanup issues. You've said you don't care, since your threads run until the end of the process, but for those that do care, boost takes care of it for you. The code to invoke callbacks on thread exit is in tss_dll.cpp and tss_pe.cpp. "

Yeah. They use a thread shim to do cleanups. That's okay and very easy to write yourself, but it means it only works for threads that are created by boost::thread.

So I can't just add TLS to any thread using boost, it's the typical boost thing where you have to buy into the whole system for it to work right.

Anyhoo, yeah, people should look at Boost. But I think the vast majority of people in games are not Boost lovers.; July 14, 2011 at 9:54 AM
Anthony Williams said...: "They use a thread shim to do cleanups. .... So I can't just add TLS to any thread using boost"

That's not true. The code in tss_pe.cpp hooks the Win32 thread-exit callbacks, so it is called when ANY thread exits. This is precisely so that you CAN use boost TLS stuff without using boost::thread to manage your threads.; July 15, 2011 at 9:39 AM
cbloom said...: Hmm.. well, I've only glanced at it so maybe you can teach me how it works.

I see some wacky stuff in src\win32\tss_pe.cpp files but it's not at all clear how it works. It seems like maybe it's latching into MS's __declspec(thread) TLS functionality.

It also looks like there is a tss_pe and a tss_dll and you have to choose the right one. That spoils the whole point, which is that you want to be able to build .lib code that works whether it is in a DLL or not. If you know at compile time whether you are in a DLL or not, then the whole problem is completely trivial and I have no idea why they're using such complicated mechanisms (you can just use MS's __declspec(thread) if you know you're in a LIB and avoid this whole mess!).

To be clear, the only reason that anyone would ever want to use this funny TLSVar nonsense is because you can't use the compiler-provided __thread mechanism, which is better in every way than these hand-cooked systems.

I *do* see that they manually call the thread exit callbacks with a shim for boost threads ; in src\win32\thread.cpp
"thread_start_function" is a shim that runs the thread func then does the cleanup, which is why I thought that was the only mechanism.; July 15, 2011 at 10:29 AM

cbloom rants

7/09/2011

07-09-11 - TLS for Win32

7 comments:

old rants