Written: 2017-12-29

xz and multi-threading

What you (don't) need to know

I often use xz with the -T nn flag on my workstation when I have the CPU cores and RAM to spare it. I was under the assumption that I was paying for the gains in speed with something else—probably a worse deflation ratio. This has always been a longstanding assumption of mine, and it was time to put it to the test recently, when someone on IRC asked me if my assumption was correct.

Test I

I decided to start with a classic test for compression: attempting to compress pseudo-random data, and measuring the increase in file size. My motivation for this is that it is one extreme of the scale (the other being consistent data, all zeros for example). Extremes tend to show up edge cases.

I generated 8 GiB of pseudo-random data in tmpfs to help reduce the effect of any fragmentation problems increasing I/O overhead:

$ dd if=/dev/urandom of=/tmp/prandom bs=8G count=1 iflag=fullblock
2+0 records in
2+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 37.7439 s, 228 MB/s

Then, I compressed this file a number of separate times, with:

$ cd /tmp/
$ for thread in 1 2 4 6 8 10 16 24 32; do \
    xz -9kv -T"$thread" prandom           \
    rm prandom.xz                         \
done

Here are the results:

 1: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000   2.1 MiB/s    1:06:06
 2: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000   4.5 MiB/s      30:13
 4: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000   8.7 MiB/s      15:46
 6: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    13 MiB/s      10:49
 8: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    16 MiB/s       8:39
10: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    18 MiB/s       7:23
16: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    25 MiB/s       5:22
24: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    27 MiB/s       5:04
32: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    34 MiB/s       4:04

With a 100 MiB block size:

 1: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000   2.4 MiB/s      56:19
 2: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000   5.0 MiB/s      27:24
 4: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000   9.7 MiB/s      14:04
 6: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    14 MiB/s       9:40
 8: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    18 MiB/s       7:36
10: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    21 MiB/s       6:23
16: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    31 MiB/s       4:26
24: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    41 MiB/s       3:21
32: 100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    47 MiB/s       2:54

Diminishing returns when throwing more than a few threads at the job. This is because xz will share an input file among the threads by splitting it into chunks of data of a certain block size. This means that if the file size and block size are such that the total number of blocks is less than the number of threads we told it to use, it will simply use less threads.

Sure enough, reducing the block size to 100 MiB gives us a faster result. 100 MiB blocks would imply 81920 blocks, which should be enough to saturate a 32-thread machine. However, it's still not scaling as it "should". With 32 threads, we would ideally¹ be seeing a 32-fold reduction in time over the single-threaded run, so about 2 minutes.

What happens if we reduce the block size to something silly: 10 MiB.

$ xz -9kv --block-size=10486760 -T32 prandom
prandom (1/1)
  100 %   8,192.4 MiB / 8,192.0 MiB = 1.000    81 MiB/s       1:41

There's our (roughly) 2 minute guesstimate figure. We achieved the ideal performance, woohoo! But wait. What? This means that we just nearly doubled the throughput to 81 MiB/s, and didn't have any more inflation than normal. What sorcery is going on here? Let's try an even smaller block size.

$ xz -9kv --block-size=1048676 -T32 prandom
prandom (1/1)
  100 %   8,192.7 MiB / 8,192.0 MiB = 1.000    83 MiB/s       1:38

Ahh there it is, and extra ~300 KiB of bloat added to the archive over and above our normal inflation. The changes in inflation were present earlier, but simply too small for the output from xz.

Weird. Let's try some other types of data.

Blank Space

Let's make some. Eight whatsits of it:

$ dd if=/dev/zero of=zeros bs=8G count=1 iflag=fullblock
1+0 records in
1+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 4.70468 s, 1.8 GB/s

Now compress it like we did earlier:

 1: 100 %   1,220.3 KiB / 8,192.0 MiB = 0.000    40 MiB/s       3:27
 2: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000    82 MiB/s       1:40
 4: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   156 MiB/s       0:52
 6: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   227 MiB/s       0:36
 8: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   286 MiB/s       0:28
10: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   335 MiB/s       0:24
16: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   527 MiB/s       0:15
24: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   596 MiB/s       0:13
32: 100 %   1,225.0 KiB / 8,192.0 MiB = 0.000   610 MiB/s       0:13

Much quicker on this run, and just look at those insignificant numbers. It does still look like the same behaviour from block size is biting us in the bum after the 4 thread mark. Let's retry with a 100 MiB block size:

 1: 100 %   1,228.7 KiB / 8,192.0 MiB = 0.000    40 MiB/s       3:23
 2: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000    82 MiB/s       1:39
 4: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   165 MiB/s       0:49
 6: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   241 MiB/s       0:33
 8: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   303 MiB/s       0:27
10: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   365 MiB/s       0:22
16: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   526 MiB/s       0:15
24: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   630 MiB/s       0:13
32: 100 %   1,229.3 KiB / 8,192.0 MiB = 0.000   677 MiB/s       0:12

Yikes! Much closer to our ideal, but still a ways off. The peak throughput is closer to the "ideal" 1.2 GiB/s, so perhaps with a larger input file it would show its speed better. Oh well. Looks like we gained just under 5 KiB when transitioning from single-threaded to multi-threaded. This might as well be zero.

This leaves the question down to less extreme data. Something that is commonly compressed with xz.

Big Booty Binaries

Let's get a bunch of built pacman packages, unxz them, tar them together, then try and compress this (roughly) 8 GiB file of all sorts - manpages, executables, configuration files, shared libraries, kernel images. All sorts of ~~stuff~~ junk.

Let's compress:

 1: 100 %   1,608.7 MiB / 8,070.5 MiB = 0.199   2.7 MiB/s      50:18
 2: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203   5.3 MiB/s      25:33
 4: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    10 MiB/s      12:49
 6: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    15 MiB/s       9:07
 8: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    18 MiB/s       7:23
10: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    22 MiB/s       6:02
16: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    32 MiB/s       4:14
24: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    36 MiB/s       3:44
32: 100 %   1,638.2 MiB / 8,070.5 MiB = 0.203    39 MiB/s       3:26

Let's retry the run again with a 100 MiB block size:

 1: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205   2.7 MiB/s      49:47
 2: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205   5.5 MiB/s      24:38
 4: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    11 MiB/s      12:31
 6: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    15 MiB/s       8:41
 8: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    20 MiB/s       6:46
10: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    24 MiB/s       5:40
16: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    34 MiB/s       3:59
24: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    41 MiB/s       3:18
32: 100 %   1,651.2 MiB / 8,070.5 MiB = 0.205    46 MiB/s       2:54

Not bad, it appears to be a trend that going from single-threaded operations to any operation with more than one thread results in a very slight increase in the compressed file size, but this does not appear to scale up or down with the number of threads. That is to say, if you want to multi-thread a compression to optimise primarily for speed, you do not need to feel guilty that throwing, say, 8 threads at it rather than 2 is giving any worse a compression ratio.

TL;DR

Turns out the gains will most likely always far outweigh the (nearly immeasurable) decrease in compression ratio.

Footnotes

I know, this isn't the perfect world and things don't quite scale linearly.