9/24/2012

09-24-12 - LZ String Matcher Decision Tree

Revisiting this to clarify a bit on the question of "I want to do X , which string matcher should I use?"

Starting from the clearest cases to the least clear :

There are two types of parsing, I will call them Optimal and Greedy. What I really mean is Optimal = find matches at every position in the file, and Greedy = find matches only at some positions, and then take big steps over the match. (Lazy parses and such are in the Greedy category).

There are three types of windowing : 1. not windowed (eg. infinite window or we don't really care about windowing), 2. fixed window; eg. you send offsets in 16 bits so you have a 64k window for matches, 3. multi-window or "cascade" ; eg. you send offsets up to 128 in 1 byte , up to 16384 in 2 bytes, etc. and you want to find the best match in each window.

There are two scan cases : 1. Incremental scan; that is, we're matching against the same buffer we are parsing; matches cannot come from in front of our current parse position, 2. Separate scan - the match source buffer is independent from the current parse point, eg. this is the case for precondition dictionaries, or just part of the file that's well before the current parse point.

1. Optimal Parsing , No window, Incremental scan : Suffix Trie is the clear winner here. Suffix Trie is only a super clear winner when you are parsing and matching at the same time, since they are exactly the same work you double your time taken if they are separate. That is, you must be scanning forward, adding strings and getting matches. Suffix Trie can be extended to Cascaded Windowing in an approximate way, by walking to parents in the tree, but doing it exactly breaks the O(N) of the Suffix Trie.

2. Optimal Parsing, No window or single window, Separate Scan : Suffix Array is pretty great here. Separate scan means you can just take the whole buffer you want to match against and suffix sort it.

(BTW this is a general point that I don't think most people get - any time you are not doing incremental update, a sort is a superb search structure. For example it's very rarely best to use a hash table when you are doing separate scan, you should just have a sorted list, possibly with jump-ins)

3. Optimal Parsing, Windowed or Cascaded, Incremental or Separate Scan : there's not an awesome solution for this. One method I use is cascades of suffix arrays. I wrote in the past about how to use Suffix Array Searcher with Incremental Scan (you have to exclude positions ahead of you), and also how to extend it to Windowing. But those method get slow if the percentage of matches allowed gets low; eg. if you have a 1 GB buffer and a 64k window, you get a slowdown proportional to (1GB/64k). To address this I use chunks of suffix array; eg. for a 64k window you might cut the file into 256k chunks and sort each one, then you only have to search in a chunk that's reasonably close to the size of your window. For cascaded windows, you might need multiple levels of chunk size. This is all okay and it has good O(N) performance (eg. no degeneracy disasters), but it's rather complex and not awesome.

Another option for this case is just to use something like Hash->Links and accept its drawbacks. A more complex option is to use a hybrid; eg. for cascaded windows you might use Hash->Links for the small windows, and then Suffix Array for medium size windows, and Suffix Trie for infinite window. For very small windows (4k or less) hash->links (or even just a "cache table") is very good, so it can be a nice supplement to a matcher like suffix trie is not great at cascaded windows.

Addendum : "Suffix Array Sets" is definitely the best solution for this.

4. Greedy Parsing : Here SuffixArray and SuffixTrie both are not awesome, because they are essentially doing all the work of an optimal-style parse (eg. string matching at every position), which is a big waste of time if you only need the greedy matches.

Hash-Link is comparable to the best matcher that I know of for greedy parsing. Yann's MMC is generally a bit faster (or finds better matches at the same speed) but is basically in the same class. The pseudo-binary-tree thing used in LZMA (and I believe it's the same thing that was used in the original PkZip that was patented) is not awesome; sometimes it's slightly faster than hash-link, sometimes slightly slower. All Window relatively easily.

Hash-Link extends very naturally to cascaded windows, because you are always visiting links in order from lowest offset to highest, you can easily find exact best matches in each window of the cascade as you go.

With Greedy Parsing you don't have to worry about degeneracies quite so much, because when you find a very long match you are just going to take it and step over it. (that is, with optimal parse if you find a 32k long match, then at the next step you will find a 32k-1 match, etc. which is a bad N^2 (or N^3) thing if you aren't super careful (eg. use a SuffixTrie with correct "follows" implementation)). However, with lazy parsing you can still hit a mild form of this degeneracy, but you can avoid that pretty easily by just not doing the lazy eval if your first match length is long enough (over 1024 or whatever).

(BTW I'm pretty sure it's possible to do a Suffix Trie with lazy/incremental update for Greedy Parsing; the result should be similar to MMC but provide exact best matches without any degenerate bad cases; it's rather complex and I figure that if I want perfect matching I generally also want Optimal Parsing, so the space of perfect matching + greedy parsing is not that important)

Previous posts on string matching :

cbloom rants 06-17-10 - Suffix Array Neighboring Pair Match Lens
cbloom rants 09-23-11 - Morphing Matching Chain
cbloom rants 09-24-11 - Suffix Tries 1
cbloom rants 09-24-11 - Suffix Tries 2
cbloom rants 09-25-11 - More on LZ String Matching
cbloom rants 09-26-11 - Tiny Suffix Note
cbloom rants 09-27-11 - String Match Stress Test
cbloom rants 09-28-11 - Algorithm - Next Index with Lower Value
cbloom rants 09-28-11 - String Matching with Suffix Arrays
cbloom rants 09-29-11 - Suffix Tries 3 - On Follows with Path Compression
cbloom rants 09-30-11 - Don't use memset to zero
cbloom rants 09-30-11 - String Match Results Part 1
cbloom rants 09-30-11 - String Match Results Part 2
cbloom rants 09-30-11 - String Match Results Part 2b
cbloom rants 09-30-11 - String Match Results Part 3
cbloom rants 09-30-11 - String Match Results Part 4
cbloom rants 09-30-11 - String Match Results Part 5 + Conclusion
cbloom rants 10-01-11 - More Reliable Timing on Windows
cbloom rants 10-01-11 - String Match Results Part 6
cbloom rants 10-02-11 - How to walk binary interval tree
cbloom rants 10-03-11 - Amortized Hashing
cbloom rants 10-18-11 - StringMatchTest - Hash 1b
cbloom rants 11-02-11 - StringMatchTest Release

No comments:

old rants