[forward] Quick Python zlib vs bz2 benchmark

// http://log.bthomson.com/2011/01/quick-python-gzip-vs-bz2-benchmark.html

Quick Python zlib vs bz2 benchmark

I use the zlib mod­ule a lot on Google App En­gine; often the tiny CPU time for de­com­pres­sion is a good trade­off to save disk space. I was cu­ri­ous how bz2 com­pares so I ran this short bench­mark.

The test file was this plain­text book, a highly-com­press­ible source. Columns are: level, time, bytes un­com­pressed, bytes com­pressed, ratio.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
% ./bench.zsh
  
zlib compress
0   6.98ms 640599 640700 1.000
1  21.22ms 640599 274195 2.336
2  25.08ms 640599 261638 2.448
3  34.24ms 640599 249649 2.566
4  36.41ms 640599 241500 2.653
5  54.24ms 640599 232545 2.755
6  77.22ms 640599 228621 2.802
7  87.94ms 640599 228032 2.809
8 112.49ms 640599 227622 2.814
9 113.03ms 640599 227622 2.814
  
zlib decompress
0   1.54ms
1   6.39ms
2   6.13ms
3   6.02ms
4   6.22ms
5   5.96ms
6   5.94ms
7   5.90ms
8   5.89ms
9   5.94ms
  
bz2 compress
1 105.30ms 640599 196752 3.256
2 103.42ms 640599 186082 3.443
3 105.40ms 640599 180905 3.541
4 104.95ms 640599 177642 3.606
5 113.12ms 640599 176232 3.635
6 110.45ms 640599 173153 3.700
7 113.06ms 640599 169634 3.776
8 110.27ms 640599 169634 3.776
9 111.43ms 640599 169634 3.776
  
bz2 decompress
1  36.40ms
2  35.79ms
3  36.35ms
4  36.81ms
5  41.18ms
6  44.86ms
7  48.96ms
8  48.45ms
9  47.95ms

Con­clu­sion: prob­a­bly not worth it. bz2 at level=4 takes about 7 times longer to de­com­press than gzip at level=9 for only a mod­est im­prove­ment in the com­pres­sion ratio from 2.8 to 3.6.

In­ter­est­ingly for write-heavy work­loads bz2 may ac­tu­ally be the bet­ter choice since com­pres­sion time is not much worse than gzip at level=9.

I think it's bet­ter not to use the timeit mod­ule for this kind of bench­mark since in typ­i­cal usage you will just be com­press­ing/de­com­press­ing some given data once. If the op­er­a­tions speed up in re­peat runs due to caching (and they do), that doesn't re­flect typ­i­cal usage. Start­ing a new python process for each test seems to re­duce cache ef­fects.

Any­way, here is the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import zlib
import bz2
import time
import sys
  
level = int(sys.argv[1])
mod = zlib if int(sys.argv[2]) else bz2
is_decompress = int(sys.argv[3])
  
with open("pg4238.txt") as f:
  data = f.read()
  
if is_decompress:
  c_data = mod.compress(data, level)
  
t = time.time()
if is_decompress:
  data = mod.decompress(c_data)
else:
  c_data = mod.compress(data, level)
  
print level, "%6.02fms" % (1000*(time.time() - t)),
if not is_decompress:
  print len(data), len(c_data), "%.03f" % (float(len(data))/len(c_data))
1
2
3
4
5
6
7
8
9
#!/usr/bin/zsh
echo 'zlib compress'
for level in {0..9}; do python bench.py $level 1 0; done
echo '\nzlib decompress'
for level in {0..9}; do python bench.py $level 1 1; done
echo '\nbz2 compress'
for level in {1..9}; do python bench.py $level 0 0; done
echo '\nbz2 decompress'
for level in {1..9}; do python bench.py $level 0 1; done
 
posted @ 2012-04-24 17:59  dtozg  阅读(450)  评论(0)    收藏  举报