2/23/2013

02-23-13 - Threading Patterns - Wake Polling

Something I've written about a lot but never given a solid name to.

When a thread is waiting on some condition, your goal should be to only wake it up if that condition is actually true - that is, the thread really gets to run. In reverse order of badness :

1. Wakeup condition polling. This is the worst and is very common. You're essentially just using the thread wakeup to say "hey your condition *might* be set, check it yourself". The suspect code looks something like :


while( ! condition )
{
    Wait(event);
}

these threads can waste a ton of cycles just waking up, checking their condition, then going back to sleep.

One of the common ways to get nasty wake-polling is when you are trying to just wake one thread, but you have to do a broadcast due to the possibility of a missed wakeup (as in the naive semaphore from waitset ).

Of course any usage of cond_var is a wake-poll loop. I really don't like cond_var as an API or a programming pattern. It encourages you to write wakee-side condition checks. Whenever possible, waker-side condition checks are better. (See previous notes on cond vars such as : In general, you should prefer to use the CV to mean "this condition is set" , not "hey wakeup and check your condition").

(ADDENDUM : in fact I dislike cond_var so much I wrote a proposal on an alternative cond_var api ).

Now it's worth breaking this into two sub-categories :

1.A. Wake-polling when it is extremely likely that you get to run immediately.

This is super standard and is not that bad. At root, what's happening here is that under normal conditions, the wakeup means the condition is true and you get to run. The loop is only needed to catch the race where someone stole your wakeup.

For example, the way Linux implements semaphore on futex is a classic wake-poll. The core loop is :


    for(;;)
    {
        if ( try_wait() ) break;

        futex_wait( & sem->value, 0 ); // wait if sem value == 0
    }

If there's no contention, you wake from the wait and get to try_wait (dec the count) and proceed. The only time you have to loop is if someone else raced in and dec'ed the count before you. (see also in that same post a discussion of why you actually *want* that race to happen, for performance reasons).

The reason this is okay is because the futex semaphore only has to do a wake 1 when it signals. If it had to do a broadcast, this would be a bad loop. (and note that the reason it can do a broadcast is due to the special nature of the futex wait, which ensures that the single thread signal actually goes to someone who needs it!) (see : cbloom rants 08-01-11 - Double checked wait ).

1.B. Wake-polling when it is unlikely that you get to run.

This is the really bad one.

As I've noted previously ( cbloom rants 07-26-11 - Implementing Event WFMO ) this is a common way for people to implement WFMO. The crap implementation basically looks like this :


while ( any events in array[] not set )
{
    wait on an unset event in array[]
}

What this does is any time one of the events in the set triggers, it wakes up all the waiters that are waiting on it in an array, checks the array, and they go back to sleep.

Obviously this is terrible, it causes bad "thread thrashing" - tons of wakeups and immediate sleeps just to get one thread to eventually run.

2. "Direct Handoff" - minimal wakes. This is the ideal; you only wake a thread when you absolutely know it gets to run.

When only a single thread is waiting on the condition, this is pretty easy, because there's no issue of "stolen wakeups". With multiple threads waiting, this can be tricky.

The only way to really robustly ensure that you have direct handoff is by making the wakeup ensure the condition.

At the low level, you want threading primitives that don't give you unnecessary wakeups. eg. we don't like the pthreads cond_var that has you call :

    condvar.wait();
    mutex.lock();
as two separate calls, which means you can wake from the condvar and immediately fail to get the mutex and go back to sleep. Prefer a single call :
    condvar.wait_then_lock(mutex);
which only wakes you when you get a cv signal *and* can acquire the mutex.

At the high level, the main thing you should be doing is *waker* side checks.

eg. to do a good WFMO you should be checking for all-events-set on the *waker* side. To do this you must create a proxy event for the set when you enter the wait, register all the events on the proxy, and then you only signal the proxy when they are all set. When one of them is set, it does the checking. That is, the checking is moved to the signaller. The advantage is that the signalling thread is already running.

02-23-13 - Threading APIs that would be ideal

If you were writing an OS from scratch right now, what low level threading primitives should you provide? I contend they are rather different than the norm.

1. A low-level keyed event with double-checked wait.

Futex and NT's keyed event are both pretty great, but the ideal low level wait should be double-checked. I believe it should be something like :


HANDLE Waitset;

Waitset CreateWaitset();
DestroyWaitset(Waitset ws);

HANDLE wait_handle = Waitset_PrepareWait( Waitset ws , U64 key );

Waitset_CancelWait( Waitset ws , wait_handle h );
Waitset_Wait( Waitset ws , wait_handle h );

Waitset_Signal( Waitset ws, U64 key );

**Now, key of course could be a pointer, but there's no reason for it to be particularly. This is easily a superset of futex; if you want you could just have one global Waitset object, and key could be an int pointer, and you could check *ptr in between PrepareWait and Wait, that would give you futex. But you can do much more with this.

I prefer having a "waitset" object to put the waits on (like KeyedEvent), not just making it global/static (like futex). The advantage is 1. efficiency and 2. multiple meanings for a single "key". It's more efficient because you can have different waitsets for different uses, which makes each one cover fewer waits, which makes all the lookups faster. (that is, rather than 100 global waits pending, maybe you have 10 on 10 different waitsets). The other advantage is that you can reuse the same value for key without it confusing the system. You could have one Waitset where key is a pointer, and another where key is an internal handle number, etc.

2. A proper cond_var with waker-side condition checking.

First of all, a decent cond_var API combines a lot of the disjoint junk in the posix API. It should include the mutex, because that allows for vastly more efficient implementation :


    class condition_var
    {
    public:
        void lock();
        void unlock();
    
        // the below are always called with lock held :

        void unlock_wait_lock();
        
        void signal_unlock();
        void broadcast_unlock();

    private:
        ...
    };

The basic usage of this cv is like :

    cv.lock();

    while( ! condition )
    {
        cv.unlock_wait_lock();
    }

    .. do stuff with condition true ..

    cv.unlock();

A good implementation should do the compound ops (signal_unlock, etc) atomically. But I wouldn't require that because it's too hard.

But that's just background. What you really want is to put the condition check in the API. It should be :


        void wait_lock( [] { wake condition } );

The spec of the API is that "wake condition" is some code that will be run with the mutex locked, and when the function exits you will own the mutex and the condition is true. Then client usage is like :

    cv.wait_lock( condition );

    .. do stuff with condition true ..

    cv.unlock();

which allows for much more efficient implementation. The wake condition of the waiter list can be evaluated easily inside signal_unlock(), because that's always called with the mutex held.

02-23-13 - Threading - Reasoning Behind Coroutine Centric Design

cbloom rants 12-21-12 - Coroutine-centric Architecture is a proposed architecture.

Why do I think it should be that way? Let's revisit some points.

1. Main thread should be a worker and all workers should be symmetric. That is, there's only one type of thread - worker threads, and all functions are work items. There are no special-purpose threads.

The purpose of this is to minimize thread switches, and to make waits be immediate runs whenever possible.

Consider the alternative. Say you have a classic "main" thread and a worker thread. Your main thread is running along and then decides it has to Wait() on a work item. It has to sleep the thread pending a notification from the worker thread. The OS has to switch to the worker, run the job, notify, then switch you back.

With fully symmetric threads, there is no actual thread wait there. If the work item is not started, or is in a yield point of a coroutine, you simply pop it and run it immediately. (of course your main() also has to be a coroutine, so that it can be yielded out at that spot to run the work item). Symmetric threads = less thread switching.

There are other advantages. One is that you're less affected by the OS starving one of your threads. When your threads are not symmetric, if one is starved (and is the bottleneck) it can ruin your throughput; one crucial job or IO can't run and then all the other threads back up. With symmetric threads, someone else grabs that job and off you go.

Symmetric threads are self-balancing. Any time you decide "we have 2 threads for graphics and 1 for compute" you are assuming you know your load exactly, and you can only be wrong. Symmetric threads maximize the utilization of the cpu. (Note that for cache coherence you might want to have a system that *prefers* to keep the same time of job on the same thread, but that's only a soft preference and it will run other jobs if there are none of the right type).

Symmetric threads scale cleanly down to 1. This is a big one that I think is important. Even just for debugging purposes, you want to be able to run your system non-threaded. With asymmetric threads you have to have a custom "non-threaded" pathway, which leads to bugs and means you aren't testing the same threaded pathway. The symmetric thread system scales down to 1 thread using the same code as always - when you wait on a job, if it hasn't been started it's just run immediately.

It's also much easier to have deadlocks in asymmetric thread systems. If an IO job waits on a graphics job, and a graphics job waits on an IO job, you're in a tricky situation; of course you shouldn't deadlock as long as there are no circular dependencies, but if one of those threads is processing in FIFO order you can get in trouble. It's just better to have a system where that issue doesn't even arise.

2. Deep yield.

Obviously if you want to write real software, you can't be returning out to the root level of the coroutine every time you want to yield.

In the full coroutine-centric architecture, all the OS waits (mutex locks, etc) should be coroutine yields. The only way to do that is if they can call yield() internally and it's a full stack-staving deep yield.

Of course you should be able to spawn more coroutines from inside your coroutine, and wait on them (with that wait being a yield). That is, aside from the outer branch-merge, each internal operation should be able to do its own branch-merge, and yield its thread to its sub-items.

3. Everything GC.

This is just the only reasonable way to write code in this system. It gives you a race-free way to ensure that object lifetimes exceed their usage.

The last post I did about the simple string crash is just so easy to do. The problem is that without GC you inevitably try to be "clever" and "efficient" (really "dumb" and "pointless") about your lifetime management. That is, you'll write things like :


void func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

Wait(h);
}

which is okay, because it waits on the async op inside the lifetime of "name". But of course a week later you change this function to :

Handle func1()
{
char name[256];
.. file name ..

Handle h = StartJob( LoadAndDecompress, name );

...

return h;
}

with the wait done externally, and now it's a crash. Manual lifetime management in heavily-threaded code is just not reasonable.

The other compelling reason is that you want to be able to have "dangling" coroutines, that is you don't want to have to wait on them and clean them up on the outside, just fire them off and the clean themselves when they finish. This requires some kind of ref-counted or GC'ed ownership of all objects.

4. A thread per core.

With all your "threading" as coroutines and all your waits as "yields", you no longer need threads to share the cpu time, so you just make one thread per core and leave it there.

I wanted to note an exception to this - OS signals that cannot be converted to yields, such as IO. In this case you still need to do a true OS Wait that would block a thread. This would stop your entire worker from running, so that's not nice.

The solution is to have a separate pool of threads for running the small set of OS functions that do internal thread waits. That is, you convert :


ReadFile( args )

->

yield RunOnThreadPool( ReadFile, args );

this separate pool of threads is in addition to the one per core (or it could just be all one pool, and you make new threads as needed to ensure that #cores of them are running).

2/18/2013

02-18-13 - The Myth of Future Value

I believe that most people (aka me) grossly overvalue future rewards when weighing the merits of various options.

I've been thinking about this a lot over the last fews days, and have come to it simultaneously from several different angles.

For the past month or so I've been going over my finances, reviewing my spending, because I'm not happy with the amount I'm saving, and I'm trying to figure out where the money is leaking. Obviously there are big expenses like cars and vacations, but those I've budgeted for, they're not the leak (*) (**), but there's still a large general money leakage that I want to track down. It turns out a lot of it is buying stuff for the house or for productivity, stuff that on its own I can justify, but overall adds up to a big waste.

A lot of that waste are things that I tell myself will "pay off someday". Like I need some rope for around the house; hey look it's a much better deal if I buy it in a 500 foot spool. I'll use it eventually so that's the better buy. Or, I need to set a bolt in concrete; sure a hammer drill is expensive, but I'll use it the rest of my life, so it will be a good value some day (better than renting one for this one job). etc. Lots of stuff where the idea is that in the long term it will be a good value.

Now I certainly haven't hit the "long term" yet, but I can already see the flaw in that logic. There are lots of reasons why that imagined long term value might never come. I might never wind up using the stuff. It might get damaged over time from sitting, or flood or who knows what. I also essentially pay a tax to store it, having stuff is not free. I pay a tax on it any time I move. Maybe I won't want to do DIY in the future and will just hire out the jobs and so won't use it. There are a lot of costs and uncertainty about this future value which make it much less valuable than it naively appears.

Perhaps computer stuff is an even easier example; like I sort of need a USB hub; I could live without it and just unplug stuff to make room depending on what device I want to use at the moment. You could easily convince yourself that the value is high because "even if I don't really need it now I'll use it someday". But of course there's any number of reasons why you might not use it some day.

(* = aside : expensive cars actually aren't that expensive; if you're careful about how you buy and sell, and look for cars that are on a pretty flat part of the depreciation curve, you can get a "$100k car" that actually only costs $5k a year. That's not really a big expense in the scheme of things. However that also doesn't mean it's free; the big cost is the time spent buying and selling; if you actually want it to be low cost you have to spend a lot of time on the transaction to get good value, and for people like me that's excruciatingly painful; for people who like wheeling-and-dealing, they can do pretty well, getting almost free stuff that they are just holding temporarily between sales)

(** = more aside, and actually there is a spending leak that I have that's associated to cars and vacations; I, like most, and perhaps less than average, fall prey to the sunk cost fallacy. The sunk cost fallacy is the idea that you've spent a bunch on something, you should stick with it and spend some more. Like I've spent this much to go on vacation, I shouldn't cheap out on the dining or whatever. Or I've got an expensive car, I should buy the expensive tires. But that of course is not true. Each decision should be evaluated independently for its value; the fact that you have a large sunk cost only matters in that it changes your current situation. You don't keep chasing your flush just because you're already called some big bets (though obviously your past calls do affect the pot size which affects your current decision)).

Of course home improvements are a classic of false future value. I'm not foolish enough to think I'll get any resale value benefit, but I do fall prey to thinking "I'll enjoy this for many years" when in fact I might not.

I was thinking about buying a really good mattress that's supposed to last 30 years vs one that will only last 5. In theory the long-life one is a much better deal, but there are any number of reasons why that might not be the case. It might not last like it's supposed to; you might pee and poo on it; you might want a different size mattress. By making an "investment" what you've done is commit yourself to something, you've removed flexibility, which is a cost.

Of course if you ever decide you want to travel the world and live in apartments again, all the buying of stuff is a big liability.

Getting away from just "accumulating stuff" now :

I've been thinking lately about my career arc. All through my young-adulthood I was carefully building up my value as a software development employee. I was improving my skills, improving my profile, networking, all that stuff, going up the career ladder. During that time I was not getting paid particularly well. I took jobs based on them being good opportunities for my larger career, not for their immediate financial reward.

The problem is that the big payoff never came. When Oddworld went under I was at the point where I could have moved on to CTO-level jobs at major studios, but I decided I didn't want to do that. The stress was ruining my life (and various other things that I've blogged about back then). The point is that this "future value" I had been building suddenly became zero. If you actually want to redeem that future value, you are locking yourself in to a path, which is a major cost you are paying, giving up flexibility in your life. And in careers there are so many factors outside your control; perhaps your specialty will become less prominent in a few years. Lots of people have done things like getting a JD only to find the law job market has dried up by the time they graduate.

Saving money in general is questionable now. The governments of the world have demonstrated that they don't care about the integrity of the world financial systems, so socking money away for the future has immense risk associated with it. (I don't put much credence in the complete currency collapse alarmists, but I do believe that a long period of negative inflation-adjusted returns is very likely). In the old days we glorified the good salaryman, who worked hard and saved some money, putting the joy of today aside to build a life for themselves and their family tomorrow.

Of course we can relate this all to poker, in old skool cbloom-rants style.

One of the first big realizations I had on my own as I was getting better and moving beyond book TAG play is that implied odds are massively overvalued by most players. "implied odds" is the term used for the imagined future value that you will get if you hit a big hand. Like if I call with a flush draw, it might be a bad value based on the immediate odds, but if I hit I'll make some more money, which makes it worth it the call. The problem is that there are a wide variety of reasons why you might not get paid off even if your flush comes (scare cards, or your opponent never had a strong calling hand to begin with). Or your flush might come and he might have a better flush (negative implied odds). If you realistically weight all those undesirable outcomes, the result is that the true effect of implied odds is very small. eg. on the turn you have a 16% chance to improve, you can call a bet for zero EV if the bet is about 20% of pot size. The action of implied odds is very small; you can only call a bet that's maybe 25% of pot size; really not much more. Certainly not the 30-35% that people talk themselves into believing is correct. (and of course in no limit holdem you have to adjust for position; out of position you should consider your implied odds to be zero, chasing a draw out of position is so very bad). What I'm getting at is the imagined future value of your current investment is far lower than you imagine.

(sort of not related but "implied odds" is also a good example of the "rationalization trap". Whenever a complicated logical exercise justifies behavior that your naughty irresponsible side secretly wants to do (like chasing draws), you should be very skeptical of that logic. Whenever you read that "a little red wine is healthy" you should be very skeptical. Whenever the result of a "study" is exactly what people want to hear, beware).

This isn't really related to the "future value" mistake, but I've been mulling another spending fallacy, which is the "value of an hour" fallacy. Sometimes I'll do something like buy a tool or hire a helper because it only costs $50 and saves me an hour of work; my hour is worth more than $50, so that's a good deal, right? I'm not so sure. I feel like that line of reasoning is just a way of rationalizing more spending, but I haven't quite found the flaw in it yet.

02-18-13 - Don't write spaghetti

It never works out. I usually even warn myself about it (by writing comments to myself), but it still catches me out. I also usually give myself a todo item like "hmm this smells funny revisit it", but of course the todos just pile up in a never-ending heap, and little old ones like that get burried.


void DoLZDecompress(const char *filename,...)
{
    struct CommandInfo i;
    i.data = (void *)filename;
    // warning : passing string pointer (not copying) to another thread, make sure it's const / sticks around!
    StartJob( &i );
}

Yup, that's a crash.


void OodleIOQ_SetKickImmediate( bool kick );
/* kick state is global ; hmm should really be per-thread ; makes it a race
*/

Yup, that's a problem, which leads to the later deadlock :

void Oodle_Wait( Handle h )
{
    // @@ ? can this handle depend on un-kicked items, and hence never complete ?
    //  I used to check for that with normal deps but it's hard now with the "permits"
    ...
}

Coding crime doesn't pay. Spaghetti always gets you in the end, with its buggy staining sauce.

Whenever I have one of those "hmm this smells funny, I'm worried about the robustness of this" , yep, it's a problem.

One of my mortal enemies are the "don't worry about it, it'll be fine" people. No it will fucking not be fine. You know what will happen? It'll be a nasty god damn race bug, which I will wind up fixing while the "don't worry about it, it'll be fine" guy is watching lolcatz or browsing facebook.

2/16/2013

02-16-13 - The Reflection Visitor Pattern

Recent blog post by Maciej ( Practical C++ RTTI for games ) set me to thinking about the old "Reflect" visitor pattern.

"Reflect" is in my opinion clearly the best way to do member-enumeration in C++. And yet almost nobody uses it. A quick reminder : the reflection visitor pattern is that every class provides a member function named Reflect which takes a templated functor visitor and applies that visitor to all its members; something like :


class Thingy
{
type1 m_x;
type2 m_y;

template <typename functor>
void Reflect( functor visit )
{
    // (for all members)
    visit(m_x);
    visit(m_y);
}

};

with Reflect you can efficiently generate text IO, binary IO, tweak variable GUIs, etc.

(actually instead of directly calling "visit" you probably want to use a macro like #define VISIT(x) visit(x,#x))

A typical visitor is something like a "ReadFromText" functor. You specialize ReadFromText for the basic types (int, float), and for any type that doesn't have a specialization, you assume it's a class and call Reflect on it. That is, the fallback specialization for every visitor should be :


struct ReadFromText
{
    template <typename visiting>
    void operator () ( visiting & v )
    {
        v.Reflect( *this );
    }
}:

The standard alternative is to use some macros to mark up your variables and create a walkable set of extra data on the side. That is much worse in many ways, I contend. You have to maintain a whole type ID system, you have to have virtuals for each type of class IO (note that the Reflect pattern uses no virtuals). The Reflect method lets you use the compiler to create specializations, and get decent error messages when you try to use new visitors or new types that don't have the correct handlers.

Perhaps the best thing about the Reflect system is that it's code, not data. That means you can add arbitrary special case code directly where it's needed, rather than trying to make the macro-cvar system handle everything.

Of course you can go farther and auto-generate your Reflect function, but in my experience manual maintenance is really not a bad problem. See previous notes :

cbloom rants 04-11-07 - 1 - Reflection
cbloom rants 03-13-09 - Automatic Prefs
cbloom rants 05-05-09 - AutoReflect

Now, despite being pro-Reflect I thought I would look at some of the drawbacks.

1. Everything in headers. This is the standard C++ problem. If you truly want to be able to Reflect any class with any visitor, everything has to be in headers. That's annoying enough that in practice in a large code base you probably want to restrict to just a few types of visitor (perhaps just BinIO,TextIO), and provide non-template accessors for those.

This is a transformation that the compiler could do for you if C++ was actually well designed and friendly to programmers (grumble grumble). That is, we have something like

template <typename functor>
void Reflect( functor visit );
but we don't want to eat all that pain, so we tell the compiler which types can actually ever visit us :
void Reflect( TextIO & visit );
void Reflect( BinIO & visit );
and then you can put all the details in the body. Since C++ won't do it for you, you have to do this by hand, and it's annoying boiler-plate, but could be made easier with macros or autogen.

2. No virtual templates in C++. To call the derived-class implementation of Reflect you need to get down there in some ugly way. If you are specializing to just a few possible visitors (as above), then you can just make those virtual functions and it's no problem. Otherwise you need a derived-class dispatcher (see cblib and previous discussions).

3. Versioning. First of all, versioning in this system is not really any better or worse than versioning in any other system. I've always found automatic versioning systems to be more trouble than they're worth. The fundamental issue is that you want to be able to incrementally do the version transition (you should still be able to load old versions during development), so the objects need to know how to load old versions and convert them to new versions. So you wind up having to write custom code to adapt the old variables to the new, stuff like :


if ( version == 1 )
{
    // used to have member m_angle
    double m_angle;
    visitor(m_angle);
    m_angleCos = cos(m_angle);
}
else
{
    visitor(m_angleCos);
}

now, you can of course do this without explicit version numbers, which is my preference for simple changes. eg. when I have some text prefs and decide I want to remove some values and add new ones, you can just leave code in to handle both ways for a while :

{

#ifndef FINAL
if ( visitor.IsRead() )
{
    double m_angle = 0;
    visitor(m_angle);
    m_angleCos = cos(m_angle);
}
#endif

visitor(m_angleCos);

}

where I'm using the assumption that my IO visitor is a NOP on variables that aren't in the stream. (eg. when loading an old stream, m_angleCos won't be found and so the value from m_angle will be loaded, and when loading a new stream the initial filling from m_angle will be replaced by the later load from m_angleCos).

Anyway, the need for conversions like this has always put me off automatic versioning. But that also means that you can't use the auto-gen'ed reflection. I suspect that in large real-world code, you would wind up doing lots of little special case hacks which would prevent use of the simple auto-gen'ed reflection.

4. Note that macro-markup and Reflect() both could provide extra data, such as min & max value ranges, version numbers, etc. So that's not a reason to prefer one or the other.

5. Reflect() can be abused to call the visitor on values that are on the stack or otherwise not actually data members. Mostly that's a big advantage, it lets you do converions, and also serialize in a more human-friendly format (for text or tweakers) (eg. you might store a quaternion, but expose it to tweak/text prefs as euler angles) (*).

But, in theory with a macro-markup cvar method, you could use that for layout info of your objects, which would allow you to do more efficient binary IO (eg. by identifying big blocks of data that can be read in binary without any conversions).

(* = whenever you expose a converted version, you should also store the original form in binary so that write-then-read is a gauranteed nop ; this is of course true even for just floating point numbers that aren't printed to all their places, which is something I've talked about before).

I think this potential advantage of the cvar method is not a real advantage. Doing super-efficient binary IO should be as close to this :


void * data = Load( one file );
GameWorld * world = (GameWorld *) data;

as possible. That's going to require a whole different pathway for IO that's separate from the cvar/reflect pathway, so there's no need to consider that as part of the pro/con.

6. The End. I've never used the Reflect() pattern in the real world of a large production codebase, so I don't know how it would really fare. I'd like to try it.

old rants