多线程压缩工具pigz使用

学习Linux系统时都会学习这么几个压缩工具:gzip、bzip2、zip、xz,以及相关的解压工具。关于这几个工具的使用和相互之间的压缩比以及压缩时间对比可以看:Linux中归档压缩工具学习

那么Pigz是什么呢?简单的说,就是支持并行压缩的gzip。Pigz默认用当前逻辑cpu个数来并发压缩,无法检测个数的话,则默认并发8个线程,也可以使用-p指定线程数。需要注意的是其CPU使用比较高。

安装

yum install pigz

使用方法

 
 1 $ pigz --help
 2 Usage: pigz [options] [files ...]
 3   will compress files in place, adding the suffix '.gz'.  If no files are
 4   specified, stdin will be compressed to stdout.  pigz does what gzip does,
 5   but spreads the work over multiple processors and cores when compressing.
 6  
 7 Options:
 8   -0 to -9, -11        Compression level (11 is much slower, a few % better)
 9   --fast, --best       Compression levels 1 and 9 respectively
10   -b, --blocksize mmm  Set compression block size to mmmK (default 128K)
11   -c, --stdout         Write all processed output to stdout (won't delete)
12   -d, --decompress     Decompress the compressed input
13   -f, --force          Force overwrite, compress .gz, links, and to terminal
14   -F  --first          Do iterations first, before block split for -11
15   -h, --help           Display a help screen and quit
16   -i, --independent    Compress blocks independently for damage recovery
17   -I, --iterations n   Number of iterations for -11 optimization
18   -k, --keep           Do not delete original file after processing
19   -K, --zip            Compress to PKWare zip (.zip) single entry format
20   -l, --list           List the contents of the compressed input
21   -L, --license        Display the pigz license and quit
22   -M, --maxsplits n    Maximum number of split blocks for -11
23   -n, --no-name        Do not store or restore file name in/from header
24   -N, --name           Store/restore file name and mod time in/from header
25   -O  --oneblock       Do not split into smaller blocks for -11
26   -p, --processes n    Allow up to n compression threads (default is the
27                        number of online processors, or 8 if unknown)
28   -q, --quiet          Print no messages, even on error
29   -r, --recursive      Process the contents of all subdirectories
30   -R, --rsyncable      Input-determined block locations for rsync
31   -S, --suffix .sss    Use suffix .sss instead of .gz (for compression)
32   -t, --test           Test the integrity of the compressed input
33   -T, --no-time        Do not store or restore mod time in/from header
34   -v, --verbose        Provide more verbose output
35   -V  --version        Show the version of pigz
36   -z, --zlib           Compress to zlib (.zz) instead of gzip format
37   --                   All arguments after "--" are treated as files

原目录大小:

 
1 [20:30 root@hulab /DataBase/Human/hg19]$ du -h
2 8.1G    ./refgenome
3 1.4G    ./encode_anno
4 4.2G    ./hg19_index/hg19
5 8.1G    ./hg19_index
6 18G .

接下来我们分别使用gzip以及不同线程数的pigz对h19_index目录进行压缩,比较其运行时间。

 
 1 ### 使用gzip进行压缩(单线程)
 2 [20:30 root@hulab /DataBase/Human/hg19]$ time tar -czvf index.tar.gz hg19_index/
 3 hg19_index/
 4 hg19_index/hg19.tar.gz
 5 hg19_index/hg19/
 6 hg19_index/hg19/genome.8.ht2
 7 hg19_index/hg19/genome.5.ht2
 8 hg19_index/hg19/genome.7.ht2
 9 hg19_index/hg19/genome.6.ht2
10 hg19_index/hg19/genome.4.ht2
11 hg19_index/hg19/make_hg19.sh
12 hg19_index/hg19/genome.3.ht2
13 hg19_index/hg19/genome.1.ht2
14 hg19_index/hg19/genome.2.ht2
15 
16 real    5m28.824s
17 user    5m3.866s
18 sys 0m35.314s
19 ### 使用4线程的pigz进行压缩
20 [20:36 root@hulab /DataBase/Human/hg19]$ ls
21 encode_anno  hg19_index  index.tar.gz  refgenome
22 [20:38 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 4 > index_p4.tat.gz 
23 hg19_index/
24 hg19_index/hg19.tar.gz
25 hg19_index/hg19/
26 hg19_index/hg19/genome.8.ht2
27 hg19_index/hg19/genome.5.ht2
28 hg19_index/hg19/genome.7.ht2
29 hg19_index/hg19/genome.6.ht2
30 hg19_index/hg19/genome.4.ht2
31 hg19_index/hg19/make_hg19.sh
32 hg19_index/hg19/genome.3.ht2
33 hg19_index/hg19/genome.1.ht2
34 hg19_index/hg19/genome.2.ht2
35 
36 real    1m18.236s
37 user    5m22.578s
38 sys 0m35.933s
39 ### 使用8线程的pigz进行压缩
40 [20:42 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 8 > index_p8.tar.gz 
41 hg19_index/
42 hg19_index/hg19.tar.gz
43 hg19_index/hg19/
44 hg19_index/hg19/genome.8.ht2
45 hg19_index/hg19/genome.5.ht2
46 hg19_index/hg19/genome.7.ht2
47 hg19_index/hg19/genome.6.ht2
48 hg19_index/hg19/genome.4.ht2
49 hg19_index/hg19/make_hg19.sh
50 hg19_index/hg19/genome.3.ht2
51 hg19_index/hg19/genome.1.ht2
52 hg19_index/hg19/genome.2.ht2
53 
54 real    0m42.670s
55 user    5m48.527s
56 sys 0m28.240s
57 ### 使用16线程的pigz进行压缩
58 [20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 16 > index_p16.tar.gz 
59 hg19_index/
60 hg19_index/hg19.tar.gz
61 hg19_index/hg19/
62 hg19_index/hg19/genome.8.ht2
63 hg19_index/hg19/genome.5.ht2
64 hg19_index/hg19/genome.7.ht2
65 hg19_index/hg19/genome.6.ht2
66 hg19_index/hg19/genome.4.ht2
67 hg19_index/hg19/make_hg19.sh
68 hg19_index/hg19/genome.3.ht2
69 hg19_index/hg19/genome.1.ht2
70 hg19_index/hg19/genome.2.ht2
71 
72 real    0m23.643s
73 user    6m24.054s
74 sys 0m24.923s
75 ### 使用32线程的pigz进行压缩
76 [20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 32 > index_p32.tar.gz 
77 hg19_index/
78 hg19_index/hg19.tar.gz
79 hg19_index/hg19/
80 hg19_index/hg19/genome.8.ht2
81 hg19_index/hg19/genome.5.ht2
82 hg19_index/hg19/genome.7.ht2
83 hg19_index/hg19/genome.6.ht2
84 hg19_index/hg19/genome.4.ht2
85 hg19_index/hg19/make_hg19.sh
86 hg19_index/hg19/genome.3.ht2
87 hg19_index/hg19/genome.1.ht2
88 hg19_index/hg19/genome.2.ht2
89 
90 real    0m17.523s
91 user    7m27.479s
92 sys 0m29.283s
93 
94 ### 解压文件
95 [21:00 root@hulab /DataBase/Human/hg19]$ time pigz -p 8 -d index_p8.tar.gz 
96 
97 real    0m27.717s
98 user    0m30.070s
99 sys 0m22.515s

 

各个压缩时间的比较:

程序线程数时间
gzip 1 5m28.824s
pigz 4 1m18.236s
pigz 8 0m42.670s
pigz 16 0m23.643s
pigz 32 0m17.523s

从上面可以看出,使用多线程pigz进行压缩能进行大大的缩短压缩时间,特别是从单线程的gzip到4线程的pigz压缩时间缩短了4倍,继续加多线程数,压缩时间减少逐渐不那么明显。
虽然pigz能大幅度的缩短运行时间,但这是以牺牲cpu为代价的,所以对于cpu使用较高的场景不太宜使用较高的线程数,一般而言使用4线程或8线程较为合适。

posted on 2020-05-23 12:18  aixer95  阅读(724)  评论(0)    收藏  举报