**Rant on New Arithmetic Coders**

Can we fucking stop calling arithmetic coding "range coding" !? It's fucking arithmetic coding. "Range coding" differs from the ancient CACM arithmetic coder in exactly two ways :

1. It uses a "virtual queue" for the carry propagation, exactly as I described in my VirtQ paper . (warning : ancient!)

2. It introduces an approximation to the range update which lets it use more bits by changing the order of multiplication :

CACM : range = range_high - range_low + 1; range_high = range_low + (symhigh * range) / symtot - 1; range_low += (symlow * range) / symtot;

"range coder" : range = range_high - range_low; range_high = range_low + symhigh * (range / symtot); range_low += symlow * (range / symtot);

This small rearrangement of the math does in fact make a huge difference, but dude look, it's literally just moving where you put the parentheses.

Of course it means you save a divide, and it means range can go up to 31 bits since you never multiply by it. Getting range up to 31 bits means you can afford to shuffle out whole bytes at a time. In CACM the multiplicataion has to fit in 32 bits so your range can only be 16 bits so outputting bytes is a no go; range stays between 15 and 16 bits all the time in CACM while a "range coder" can go down to 23 bits; when it's down at 23 bits you only have 9 bits of coding precision because of the approximation of doing the divide before the multiply.

I'm sick of all these web pages that are like "LZMA uses a statistical algorithm called range coding". No, it uses a fucking entropy coder called arithmetic coding.

Okay, this was just an angry rant, but it reminds me of something interesting. I recently stumbled on some new arithmetic coders and they blew my mind for a minute until I figured out how they work. (BTW For reference, the canonical "range coder" is by Michael Schindler )

(BTW #2 it's sort of curious to me historically that I missed the range coder; I was so close to inventing it myself, but somehow got right next to it and didn't see it. All the pieces of the range coder were thought of previously by various people, including the original CACM guys, such as trying to use 32 bits, shifting out bytes, approximating the division, and the virtual queue, but on their own each of those ideas doesn't quite work or make a big difference, nobody put those ideas together and realized that together they make this "range coder" thing that's a big improvement).

The new arithmetic coders I've discovered are the fpaq0p binary range coder and "rc" the "russian range coder". FPAQ0p is by Ilia Muraviev, the guy from "encode.ru" who's done lots of good stuff; you can find it on Matt Mahoney's PAQ page . The "russian range coder" used to have a web page but it appears to have disappeared. I found the code in MRP (Minimum Rate Predictors, a decent lossless image coder).

The "fpaq0p" arithmetic encoder is brutally simple. It's what you would write if you just heard of arithmetic encoding; in fact it was my first idea for a coder ever, but it didn't work, so everybody dropped it. Well, it's back. If you remember the super basic idea of an arithmetic coder is you're specifying a code stream by writing an infinite precision number; at any time you know your desired number is in some interval, low and high, you shrink the interval down to specify the symbol, and to do it in finite precision you shift out top bits of low & high as they become the same. (See my ancient write up of basic arithcoding )

So what would you write?

range_low = 0; range_high = 0xFFFFFFFF; // 32 bits on range = range_high - range_low; range_high = range_low + symhigh * (range / symtot); range_low += symlow * (range / symtot); while( (range_low>>24) == (range_high>>24) ) { *outPtr+= = (range_low>>24); range_low <<= 8; range_high <<= 8; range_high += 0xFF; }Okay, good job, but this is badly broken. First of all, you're overflowing your 32 bit code. Second of all, it doesn't handle carry propagation, and it doesn't handle underflow. The problem with carries is that range can step up to the next bit and push a bit over code size. The problem with underflow is that we aren't handling the case of the range getting very very small but while the top bytes are different.

In particular, when range_high comes down close to range_low from above and range_low comes up close to range_high from below, but the top bytes are different, eg, they're like 0x800000AA and 0x7FFFFFBB , the top bytes are different so they don't get shifted out, but the range can get arbitrarily small. (in particular the top byte of "low" is X, the top byte of "hi" is X+1, the second byte of low is 0xFF and the second byte of hi is 0x00).

But fpaq0p works, so how does it do it?

The main trick is to only do binary coding. Then you just don't worry about underflow. With the underflow problem, your range can get as small as only 1 , with 0x01000000 and 0x00FFFFFF, but you implicitly add 1 to the top, so you still have enough range to code 2 symbols, which is what you need for your binary alphabet to be decodable. Any alphabet larger than binary is not codable there.

The other minor trick is just about where you do your "+1". The high is always implicitly +1 above what you actually store in range_high. We started it with 0xFFFFFFFF but we want an implicit FFFFF... repeating to infinity after it. That's why we shift in FF, but we also have to do it right in the range update in a way that doesn't blow the 32 bits or ever generate a carry.

So the working fpaq0p coder is :

range_low = 0; range_high = 0xFFFFFFFF; // 32 bits on range = range_high - range_low; range_mid = range_low + symp0 * (range / symtot); if ( bit ) range_low = range_mid+1; else range_high = range_mid; while( (range_low>>24) == (range_high>>24) ) { *outPtr+= = (range_low>>24); range_low <<= 8; range_high <<= 8; range_high += 0xFF; }

Which looks like it's doing the same trivial thing, but is subtly different.

Note that in the horrible underflow case your coding precision can get as bad as only 1 bit (that is, the probabilities are forced to 50/50 regardless of what they really are). Obviously that hurts coding, but it's actually not that bad, because it's statistically incredibly rare. In fact it happens something like 2^-24 of the time, since all spots on the code range are equally likely. And even when it does happen you only lose a few bits of coding efficiency.

BTW :

To do this faster : while( (range_low>>24) == (range_high>>24) ) You can do : while( ((x1 ^ x2)>>24) == 0 ) or while( ((x1 ^ x2)&0xFF000000) == 0 ) or while( (x1 ^ x2) < (1<<24) )but in practice it doesn't really matter because you're stalled out on branches anyway.

Also in a binary arithmetic coder you never actually divide by symtot. You use symtot = some fixed power of 2, and you do a probability update that keeps tot the same. My code's got this old rung/ladder thing that does different styles of adaptation with a power of 2 total. Or you can just do the super simple thing that everybody seems to do these days :

#define symbits (12) // or whatever < 14 #define symtot (1<< symbits) #define updshift (6) // or anything in {1,symbits-1} void update(int & p0,bool bit) { if ( bit ) { p0 -= p0 >> updshift; } else { p0 += (symtot - p0) >> updshift; } }

Here "updshift" controls how fast your adaptation is vs. how precise you become in the long term. This update makes no distinction between the initial rollup when you have no stats and the adapted state, which the rung/ladder does. This is an approximation of a normal n0/n1 counter with an implicit max count of (1 << updshift). Compare to an n0/n1 update :

void update(int & p0,bool bit) { int n0 = p0; int n1 = symtot - p0; int increment = symtot >> updshift; if ( bit ) { n1 += increment; } else { n0 += increment; } p0 = (n0 * symtot) / (n0 + n1); }

We can see the equivalence and approximation by looking at the bit on case :

I'll use := to mean assignment while I do maths : p0 := (n0 * symtot) / (n0 + n1); p0 := (p0 * symtot) / (symtot + incr); p0 := (p0 * symtot)*(symtot - incr) / (symtot + incr)*(symtot - incr); now approximate incr^2 is much less than symtot^2 p0 := (p0 * symtot)*(symtot - incr) / (symtot^2); p0 := (p0)*(symtot - incr) / (symtot); p0 := p0 - (p0*incr)/symtot); p0 -= p0/(symtot/incr); p0 -= p0>>updshift;

BTW this approximation goes back to the ancient stuff by Howard & Vitter on fast arithmetic coding. Of course in practice you also want to merge your modelling with arithmetic coding since they do the same random branch on bit.

Most people these days are good at designing their compressor to drive the binary arithmetic coder well. A well driven binary arithmetic coder is generally coding probabilities close to 50%. That's better because it keeps precision high and keeps you in the accurate part of the coarse probability update approximation. It's also better because it means you're calling the coder many fewer times. If you only call the arithmetic coder with probablities close to 50% it means you're only doing coding roughly as many times as the number of bits you output, which is low. You accomplish this by coding symbols that are "natural" for the problem. For example, for LZ77 offsets you code in the log2 of the offset (similarly with BWT MTF codes). etc. etc.

So I also found this range coder implementation called the "Russian Range Coder". It's extremely similar to the standard Michael Schindler coder except the renorm & bit output loop is slightly different :

"Russian Range Coder" : while((rc->low ^ (rc->low + rc->range)) < (1<<24)) { *rc->ptr++ = (U8) (rc->low >> 24); rc->range <<= 8; rc->low <<= 8; } while(rc->range < (1 << 16)) { *rc->ptr++ = (U8) (rc->low >> 24); rc->range = ((~rc->low) & 0xFFFF) << 8; rc->low <<= 8; }So what's going on here? The first part we can recognize is exactly the simple loop from fpaq0p that just outputs the top byte when it's the same. But if you just do that you're fucked for multisymbol alphabets because of underflow, so looky here, the second part handles that. It checks if the range is getting too small to resolve symbols. But normally in the virtual queue / schindler coder that would mean setting up a carry queue and such. This coder avoids that. It does it by just giving up a huge chunk of the range.

The bad case occurs when the range is small and the top bytes are off by 1. The second bytes are 0x00 and 0xFF. With a virtual queue, you would put the top byte of "low" in the virtual queue spot, and count the number of 0xFF bytes but don't write them yet. If you get a carry, you propagate the carry by writing the top of low+1, then implicitly flipping the 0xFF's to 0x00's and writing them. This is the virtual Q / schindler way.

Here we don't check for underflow and do the carry and such, we just smash down the "high" to the "low". We know the top bytes are not equal, so the next bytes must be 00 and FF, eg. we have something like 0x0B00XXXX and 0x0AFFXXXX , we just smash the high down to 0x0AFFFFFF. This bit :

rc->range = ((~rc->low) & 0xFFFF) << 8; rc->low <<= 8; is the same as : high = low | 0xFFFF; range = high - low; range <<= 8; low <<= 8;which I think makes it clearer what's happening. You just chop down to make your top bytes the same, eliminating the bad straddle situation which caused underflow. Now you can renormalize the simple way by shifting out equal top bytes and get a big enough range back.

Now obviously this is an approximation that loses some efficiency, you're throwing away on average half your range in the cases that you have this bad underflow, but again it's pretty rare. I don't know the exact probabilities, but the measured efficiency is very close to the full coder.

In practice, however, this coder is not any faster than the full correct schindler coder, so there's no point to using it. It is however, very simple and elegant once you see how it works.

ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand. It seems to offer no advantages and many disadvantages, but I would like to grok it for my own brain's happiness.

## No comments:

Post a Comment