7/03/2010

07-03-10 - Length-Limitted Huffman Codes Heuristic

In the last post if you look at the comments you can find some comparison of optimal length limitted vs. heuristic length limitted.

I thought I would describe the heuristic algorithm. It is O(N) with no additional storage (it can work in place, which goes nicely with Moffat's in place Huffman codelen builder ). Here's the algorithm :

1. Build Huffman code lengths using Moffat INPLACE. You observe some of those code lengths are > maxCodeLen. We will work only on the code lengths, and we are given the symbol counts. We are given the symbol counts in sorted order (this was already done for INPLACE; if they were not originally sorted a simple NlogN sort will make them so).

2. Set all code lengths > max to be = maxCodeLen. We now have invalid code lengths, they are not "prefix". That is, they do not satisfy the kraft inequality K <= 1 for decodability.

3. Compute the Kraft number, K = Sum { 2 ^ - L_i } ; we currently have K > 1 and want to shrink it down to K = 1 by increasing some code lengths.

4. PASS 1. Walk over the symbols in sorted order (from lowest count to highest) while K > 1. Do :


  while ( codeLen[ s ] < max && K > 1 )
  {
    codeLen[ s ] ++;

    // adjust K for change in codeLen
    K -= 2 ^ - codeLen[ s ]
  }

5. PASS 2. Walk over the symbols backwards (from highest to lowest count) while K < 1. Do :


  while ( (K + 2^-codeLen[ s ]) <= 1 )
  {
    // adjust K for change in codeLen
    K += 2 ^ - codeLen[ s ]

    codeLen[ s ] --;
  }

6. You now have a set of codelens with K = 1 and all codeLens <= max. Fini.

Okay, so what's happening here ?

There's one forward pass and one backwards pass. First we truncate the code lengths that were too long. We are now in trouble and we need to find some other code lengths that we can make longer so that we are prefix again. The best code to make longer is the one with the lowest symbol count. It doesn't matter how long the current code length is, the cost of doing L += 1 is always the symbol count. So we just walk forward from the lowest symbol count. (*).

After step 4 you have a code with K <= 1 , if it's == 1 you're done, but sometimes it is < 1 because you bumped a lower codelen than necessary and you have a bit of space in your prefix code. To take advantage of this you want to find the highest count symbol whose length you can decrease and still have a prefix code.

As noted in the previous post this can be far from optimal, but in the standard case it just doesn't matter much because these are the very rare symbols.

footnotes :

(* while it is true that the cost is independent of L, the benefit to K is not independent of L, so adjusting shorter code lens is probably better. Instead of the minimum symbol count (C) you want to minimize the cost per benefit, which is C * 2^L . So you'd have to maintain a priority queue (**).)

(** it's actually more complex than that (I just tried it). In this step you will often be overshooting K, when considering overshooting you have to consider the penalty from doing len++ in the step that does the overshoot vs. how much you can get back by doing len-- elsewhere to come back to K=1. That is, you need merge step 4 and 5 such that you create a single priority queue which consists of some plain len++ ops, and also some ops that do one len++ some number of other len--'s, and pick the best of those options which doesn't overshoot K. Keep doing ops while K > 1 and you will wind up with K = 1. ).

Actually I wonder if this is a way to reconcile Huffman code building with Package-Merge ?

What would the correct priority queue op be for the (**) footnote ?

Say you're considering some op that does a len++ somewhere and overshoots K. You need compensate with some amount of K value to correct. Say that value you need to correct is 2^L. You can either do len-- on a code of length L, or you can do it on two codes of length L+1. Or one of length L+1 and two of length L+2.

Yep, I see it. Construct a priority queue for each length L. In the queue are symbols of code length L, and also pairs of items of length L+1 (an item is either a symbol or a pair). To correct K by 2^L you pick the best item from the L'th queue.

But rather than mess with this making an initial K and then doing corrections, you can just start with all L = 0 and K = N and then consider doing L++ on each code, that is, so you start by taking the best items from the L = 1 list. Which is just the package-merge algorithm !

Note that seeing this equivalence relies on some properties of the package-merge algorithm that aren't obvious. When you are pulling nodes at the final list (the L = 1 list), you can either pick a symbol; picking a symbol means its length was 0 and you are making it 1. That means that symbol was never picked before. (this is true because a coin i is never picked in an earlier list before it is made active in the final list). Or, if you don't pick a symbol you can pick a pair from the next L list. This corresponds to doing L++ on those code lengths. The key thing is : if a tree item has child i at level L, then child i already occurs L times as a raw symbol. This must be true because the cost of the tree item containing child i is > the cost of child i itself, so at all those levels child i would have chosen before the tree item.

For example :


L=3:   A  B

L=2:   A  B  {AB}  C

L=1:   A  B  {AB}  C  {AB|C}

At the point where we select {AB} in the L=1 list, A and B must already have occured once so their length is already 1. So {AB} means change both their lengths from 1 to 2; this adds them to the active set on the 2 list.

No comments:

old rants