cbloom rants: 03/2017

JPEG2 proposal / rough principles :

1. Simple simple simple. The decoder should be implementable in ~5000 lines as a single file stb.h style header. Keep it simple!

2. It should be losslessly transcodable from JPEG , ala packJPG/Lepton. That is, JPEG1 should be contained as a subset. (this just means having 8x8 DCT mode, quantization matrix). You could have other block modes in JPEG2 that simply aren't used when you transcode JPEG. You replace the entropy coded back-end with JPEG2 and should get about 20% file size reduction.

IMO this is crucial for rolling out a new format, nobody should ever be trancoding existing JPEGs and thereby introducing new error.

3. Reasonably fast to decode. Slower than JPEG1 by maybe 2X is okay, but not by 10X. eg. JPEG-ANS is okay, JPEG-Ari is probably not okay. Also think about parallelism and GPU decoding for huge images (100 MP). Keeping decoding local is important (eg. each 32x32 block or so should be independently decodable).

4. Decent quality encoding without crazy optimizing encoders. The straightforward encode without big R-D optimizing searches should still beat JPEG.

5. Support for per-block Q , so that sophisticated encoders can do bit rate allocation.

6. Support alpha, HDR. Make a clean definition of color space and gamma. But *don't* go crazy with supporting ICC profiles and lots of bit depths and so on. Needs to be the smallest set of features here. You don't want to get into the situation that's so common where the format is too complex and nobody actually supports it right in practice, so there becomes a "spec standard" and a "de-facto standard" that don't parse lots of the optional modes correctly.

7. Support larger blocks & non-square blocks; certainly 16x16 , maybe 32x32 ? Things like 16x8 , etc. This is important for increasingly large images.

Most of all keep it simple, keep it close to JPEG, because JPEG actually works and basically everything else in lossy image compression doesn't.

Anything that's not just DCT + quantize + entropy is IMO a big mistake, very suspicious and likely to be vaporware in the sense that you can make it look good on paper but it won't work well in reality.

ADD :

I have in the past posted many times about how plain old baseline JPEG + decent back-end entropy (eg. packJPG/Lepton) is surprisingly competitive with almost every modern image codec.

That's actually quite surprising.

The issue is that baseline JPEG is doing *zero* R-D optimization. Even if you use something like mozjpeg which is doing a bit of R-D optimization, it's doing it for the *wrong* rate model (assuming baseline JPEG coding, not the packjpg I then actually use).

It's well known that doing R-D optimization correctly (with the right rate model) provides absolutely enormous wins in lossy compression, so the fact that baseline JPEG + packJPG without any R-D at all can perform so well is really an indictment of everything it beats. This tells us there is a lot of room for easy improvement.

I had a report from a customer of poor Kraken decode performance on PS4 when using 10 simultaneous threads for decoding, and it occurred to me I had never tested that thoroughly. (Their issue turned out to be something else; see end of post).

There is reason to be concerned about running a lot of Kraken (or Mermaid/Selkie) decodes simultaneously. On most modern systems, like the PS4, the many cores share caches, perhaps share memory busses or TLBs. That means while you have N* the compute performance, you may have cache conflicts, and you could wind up bottlenecking on some of the memory subsystem. (generally we don't run into bandwidth bottlenecks, but there are lots of other limitted resources, like queue sizes, etc.)

Anyhoo, onto the testing -

I ran N threaded decodes of the same file. The buffers are copied for each thread so they can't share any cache for input or output buffers. Wiped caches before runs. I then wait on all N decodes being done and time that.

The graphs show total time for all N decodes, and time per decode (total/N).

If you had infinite compute resources, then "total time" (orange) would be a flat line. Any number of threads would take the same total time, it would not change.

Once you hit the limits of the system, the "time per" (blue) should be constant, and total then should go up linearly. (actually not quite, because when you are off the core # modulo, the threads don't all complete at the same time so you get wasted idle time; see the jump on lappy from 4-6 cores then how flat it is from 6-8, same on PS4 from 6-8 cores then flat from 9-12). If you have the threads to spare, then you can maximize throughput by minimizing "time per".

Conclusion :

No problem with lots of simultaneous Kraken decodes. Even when heavily over-subscribed, there's no major perf inversion due to overloading cache or memory subsystems.

Kraken on PS4 has near perfect threading up to 6 threads (total time goes from 0.0099 - 0.0111) ; on lappy it's not as good but still provides benefit to the time per decode up to 4 threads (total time from 0.0060 - 0.0095).

It's a surprise to me that the PS4 scales so well despite sharing cache & memory bus for the first 4 cores. It's also a surprise that lappy scales less well, I thought it would be near perfect on the first 4 cores, but maybe that's just Windows not giving me the whole machine? That was backward from my expectation.

Charts :

Kraken on PS4 (6 cores; 4 cores per 2MB L2) :

lzt24 :

lzt99:

Almost perfect threading from 1-6 cores (total time constant) even with large binary file.

webster:

webster is a large text file that uses a lot of long distance matches (offset > 1M). Text files have very different character than binary files like the lzt's. We can see that the large hot memory region used by webster does put some stress on the shared L2, there's falloff in perf from 1-4 cores.

webster Selkie :

Selkie is much faster than Kraken (2.75X faster on webster PS4) so all else being equal it should be affected a lot more by thread contention hurting memory latency. But, Selkie has some unique cleverness that makes it immune to this drawback. Threading even on webster from 1-6 cores is near perfect.

Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) no turbo :

lzt24 :

lzt99 :

webster :

Similar to PS4, lappy has almost perfect threading on binary files from 1-4 cores. On webster there is falloff in perf due to the

Kraken on my laptop (4 cores) (Core i7 Q820) (4x256 kb L2 , 8 MB L3) (+4 hypercores) WITH TURBO :

lzt24 :

lzt99 :

I initially mistakenly posted lappy timings with turbo enabled. I usually turn it off for perf testing on my laptop so that timings are more reliable. I think it's interesting actually to look at how the perf falloff is different with turbo.

Without turbo, total time is constant on lzt24 and lzt99 from 1-4 cores, but with turbo it steadily falls off, as adding more cores causes the laptop to reduce its clock rate. Despite that there's still a solid gain to throughput (the blue "time per" is going down despite the clock rate also going down).

raw data : (lzt24)


lappy : no turbo : (*1000)
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475
5,  14.6867,   2.9373
6,  16.6759,   2.7793
7,  19.1105,   2.7301
8,  20.1687,   2.5211
9,  23.6391,   2.6266
10,  25.9279,   2.5928
11,  27.7395,   2.5218
12,  27.6459,   2.3038
13,  30.7935,   2.3687
14,  31.8541,   2.2753
15,  33.7883,   2.2526
16,  34.8252,   2.1766

lappy : with turbo :
1,   0.0060,   0.0060
2,   0.0070,   0.0035
3,   0.0087,   0.0029
4,   0.0095,   0.0024 <- 4
5,   0.0133,   0.0027
6,   0.0170,   0.0028
7,   0.0175,   0.0025
8,   0.0193,   0.0024 <- 8
9,   0.0228,   0.0025
10,   0.0252,   0.0025
11,   0.0262,   0.0024
12,   0.0278,   0.0023 <- 12
13,   0.0318,   0.0024
14,   0.0310,   0.0022
15,   0.0325,   0.0022
16,   0.0346,   0.0022 <- 16

PS4 :
1,   0.0099,   0.0099
2,   0.0102,   0.0051
3,   0.0104,   0.0035
4,   0.0106,   0.0027
5,   0.0110,   0.0022
6,   0.0111,   0.0018 <- min
7,   0.0147,   0.0021
8,   0.0180,   0.0022
9,   0.0204,   0.0023
10,   0.0214,   0.0021
11,   0.0217,   0.0020
12,   0.0220,   0.0018 <- same min again
13,   0.0257,   0.0020
14,   0.0297,   0.0021
15,   0.0310,   0.0021
16,   0.0319,   0.0020



comparing just lappy turbo to no-turbo :

lappy : no turbo :
1,   9.1360,   9.1360
2,   9.5523,   4.7761
3,   9.7850,   3.2617
4,  10.1901,   2.5475

lappy : with turbo :
1,   6.0,   6.0
2,   7.0,   3.5
3,   8.7,   2.9
4,   9.5,   2.4

You can see with only 1 core, turbo is 1.5X faster (9.13/6.0) than no turbo
With 4 cores they are getting close to the same speed, (10.2 vs 9.5), the turbo
has almost completely clocked down

The customer's actual issue was decoding into write-combined graphics memory. This is an absolute killer for decoder perf because Kraken (like any LZ decoder) needs to read back the buffers it writes.

On the PS4 I think the best way to decode to graphics memory (garlic) is to allocate the memory as writeback onion, do the decompress, then change it to wb_garlic with sceKernelBatchMap (which will cause a CPU cache flush; several of these changes could be combined together, eg. for level loading you only need to do it once at the end of all the resource decoding, don't do it per resource).

cbloom rants

3/24/2017

JPEG2

3/08/2017

Kraken Perf with Simultaneous Threaded Decodes

old rants