[forward] Quick Python zlib vs bz2 benchmark

The test file was this plaintext book, a highly-compressible source. Columns are: level, time, bytes uncompressed, bytes compressed, ratio.

% ./bench.zsh 
  
zlib compress 
0   6.98ms 640599 640700 1.000 
1  21.22ms 640599 274195 2.336 
2  25.08ms 640599 261638 2.448 
3  34.24ms 640599 249649 2.566 
4  36.41ms 640599 241500 2.653 
5  54.24ms 640599 232545 2.755 
6  77.22ms 640599 228621 2.802 
7  87.94ms 640599 228032 2.809 
8 112.49ms 640599 227622 2.814 
9 113.03ms 640599 227622 2.814 
  
zlib decompress 
0   1.54ms 
1   6.39ms 
2   6.13ms 
3   6.02ms 
4   6.22ms 
5   5.96ms 
6   5.94ms 
7   5.90ms 
8   5.89ms 
9   5.94ms 
  
bz2 compress 
1 105.30ms 640599 196752 3.256 
2 103.42ms 640599 186082 3.443 
3 105.40ms 640599 180905 3.541 
4 104.95ms 640599 177642 3.606 
5 113.12ms 640599 176232 3.635 
6 110.45ms 640599 173153 3.700 
7 113.06ms 640599 169634 3.776 
8 110.27ms 640599 169634 3.776 
9 111.43ms 640599 169634 3.776 
  
bz2 decompress 
1  36.40ms 
2  35.79ms 
3  36.35ms 
4  36.81ms 
5  41.18ms 
6  44.86ms 
7  48.96ms 
8  48.45ms 
9  47.95ms

Conclusion: probably not worth it. bz2 at level=4 takes about 7 times longer to decompress than gzip at level=9 for only a modest improvement in the compression ratio from 2.8 to 3.6.

Interestingly for write-heavy workloads bz2 may actually be the better choice since compression time is not much worse than gzip at level=9.

I think it's better not to use the timeit module for this kind of benchmark since in typical usage you will just be compressing/decompressing some given data once. If the operations speed up in repeat runs due to caching (and they do), that doesn't reflect typical usage. Starting a new python process for each test seems to reduce cache effects.

Anyway, here is the code.

import zlib 
import bz2 
import time 
import sys 
  
level = int(sys.argv[1]) 
mod = zlib if int(sys.argv[2]) else bz2 
is_decompress = int(sys.argv[3]) 
  
with open("pg4238.txt") as f: 
  data = f.read() 
  
if is_decompress: 
  c_data = mod.compress(data, level) 
  
t = time.time() 
if is_decompress: 
  data = mod.decompress(c_data) 
else: 
  c_data = mod.compress(data, level) 
  
print level, "%6.02fms" % (1000*(time.time() - t)), 
if not is_decompress: 
  print len(data), len(c_data), "%.03f" % (float(len(data))/len(c_data))

#!/usr/bin/zsh 
echo 'zlib compress'
for level in {0..9}; do python bench.py $level 1 0; done
echo '\nzlib decompress'
for level in {0..9}; do python bench.py $level 1 1; done
echo '\nbz2 compress'
for level in {1..9}; do python bench.py $level 0 0; done
echo '\nbz2 decompress'
for level in {1..9}; do python bench.py $level 0 1; done

dtozg

--人生苦短，我用python。

Quick Python zlib vs bz2 benchmark

公告