8/01/2011

08-01-11 - A game threading model

Some random ideas.

There is no "main" thread at all, just a lot of jobs. (there is a "main job" in a sense, that runs once a frame kicks off the other jobs needed to complete that frame)

Run 1 worker thread per core; all workers just run "jobs", they are all interchangeable. This is a big advantage for many reasons; for example if one worker gets swapped out (or some outside process takes over that CPU), the other workers just take over for it, there is never a stall on a specific thread that is swapped out. You don't have to switch threads just to run some job, you can run it directly on yourself. (caveat : one issue is the lost worker problem which we have mentioned before and needs more attention).

You also need 1 thread per external device that can stall (eg. disk IO, GPU IO). If the API's to these calls were really designed well for threading this would not be necessary - we need a thread per device simply to wrap the bad API's and provide a clean one out to the workers. What makes a clean API? All device IO needs to just be enqueue'd immediately and then provide a handle that you can query for results or completion. Unfortunately real world device IO calls can stall the calling thread for a long time in unpredictable ways, so they are not truly async on almost any platform. These threads should be high priority, do almost no CPU work, and basically just act like interrupts.

A big issue is how you manage locking game objects. I think the simplest thing conceptually is to do the locking at "game object" granularity, that may not be ideal for performance but it's the easiest way for people to get it right.

You clearly want some kind of reader/writer lock because most objects are read many more times than they are written. In the ideal situation, each object only updates itself (it may read other objects but only writes itself), and you have full parallelism. That's not always possible, you have to handle cross-object updates and loops; eg. A writes A and also writes B , B writes B and also writes A ; the case that can cause deadlock in a naive system.

So, all game objects are referenced through a weak-reference opaque handle. To read one you do something like :

    const Object * rdlock(ObjectHandle h)
and then rely on C's const system to try to ensure that people aren't writing to objects they only have read-locked (yes, I know const is not ideal, but if you make it a part of your system and enforce it through coding convention I think this is probably okay).

In implementation rdlock internally increments a ref on that copy of the object so that the version I'm reading sticks around even if a new version is swapped in by wrlock.

There are various ways to implement write-lock. In all cases I make wrlock take a local copy of the object and return you the pointer to that. That way rdlocks can continue without blocking, they just get the old state. (I assume it's okay for reads to get one-frame-old data) (see note *). wrunlock always just exchanges in the local object copy into the table. rdlocks that were already in progress still hold a ref to the old data, but subsequent rdlocks and wrlocks will get the new data.

One idea is like this : Basically semi-transactional. You want to build up a transaction then commit it. Game object update looks something like this :

    Transaction t;
    vector<ObjectHandle> objects_needed;
    objects_needed = self; 
    for(;;)
    {
        wrlock on all objects_needed;

        .. do your update code ..
        .. update code might find it needs to write another object, then do :

        add new_object to objects_needed
        if ( ! try_wrlock( new_object ) )
            continue; // aborts the current update and will restart with new_object in the objects_needed set

        wrunlock all objects locked
        if ( unlocks committed )
            break; // update done
    }

(in actual C++ implementation the "continue" should be a "throw" , and the for(;;) should be try/catch , because the failed lock could happen down inside some other function; also the throw could tell you what lock caused the exception).

There's two sort of variants here that I believe both work, I'm not sure what the tradeoffs are :

1. More mutex like. wrlock is exclusive, only one thread can lock an object at a time. wrunlock at the end of the update always proceeds unconditionally - if you got the locks you know you can just unlock them all, no problem. The issues is deadlock for different lock orders, we handle that with the try_lock, we abort all the locks and go back to the start of the update and retake the locks in a standardized order.

2. More transaction like. wrlock always proceeds without blocking, multiple threads can hold wrlock at the same time. When you wrunlock you check to see that all the objects have the same revision number as when you did the wrlock, and if not then it means some other commit has come in while you were running, so you abort the unlock and retry. So there's no abort/retry at lock time, it's now at unlock time.

In this simplistic approach I believe that #1 is always better. However, #2 could be better if it checked to see if the object was not actually changed (if it's a common case to take a wrlock because you thought you needed it, but then not actually modify the object).

Note that in both cases it helps to separate a game object's mutable portion from its "definition". eg. the things about it that will never change (maybe its mesh, some AI attributes, etc.) should be held to the side somehow and not participate in the wrlock mechanism. This is easy to do if you're willing to accept another pointer chase, harder to do if you want it to just be different portions of the same continuous memory block.

Another issue with this is if the game object update needs to fire off things that are not strictly in the game object transaction system. For example, say it wants to start a Job to do some path finding or something. You can't fire that right away because the transaction might get aborted. So instead you put it in the "Transation t" thing to delay it until the end of your update, and only if your unlocks succeed then the jobs and such can get run.

(* = I believe it's okay to read one frame old data. Note that in a normal sequential game object update loop, where you just do :


for each object
    object->update();

each object is reading a mix of old and new data; if it reads an item in the list before itself, it reads new data, if it reads an item after itself, it reads old data; thus whether it gets old or new data is a "race" anyway, and your game must be okay with that. Any time you absolutely must read the most recent data you can always do a wrlock instead of a rdlock ;

You can also address this in the normal way we do in games, which is separate objects in a few groups and update them in chunks like "phase 1", then "phase2" ,etc. ; objects that are all within the same phase can't rely on their temporal order, but objects in a later phase do know that they see the latest version of the earlier phase. This is the standard way to make sure you don't have one-frame-latency issues.

*).

The big issue with all this is how to ensure that you are writing correct code. The rules are :

1. rdlock returns a const * ; never cast away const

2. game object updates must only mutate data in game objects - they must not mutate global state or anything outside of the limitted transaction system. This is hard to enforce; one way might be to make it absolutely clear with a function name convention which functions are okay to call from inside object updates and which are not.

For checking this, you could set a TLS flag like "in_go_update" when you are in the for {} loop, then functions that you know are not safe in the GO loop can just do ASSERT( ! in_go_update ); which provides a nice bit of safety.

3. anything you want to do in game object update which is not just mutating some GO variables needs to be put into the Transaction buffer so it can be delayed until the commit goes through. Delayed transaction stuff cannot fail; eg. it doesn't get to participate in the retry/abort, so it must not require multiple mutexes that could deadlock. eg. they should pretty much always just be Job creations or destructions that are just pushes/pops from queues.

Another issue that I haven't touched on is the issue of dependencies. A GO update could be dependent on another GO or on a Job completion. You could use the freedom of the scheduling order to reschedule GOs whose dependencies aren't done for later in the tick, rather than stalling.

No comments:

old rants