多线程压缩工具pigz使用
学习Linux系统时都会学习这么几个压缩工具:gzip、bzip2、zip、xz,以及相关的解压工具。关于这几个工具的使用和相互之间的压缩比以及压缩时间对比可以看:Linux中归档压缩工具学习
那么Pigz是什么呢?简单的说,就是支持并行压缩的gzip。Pigz默认用当前逻辑cpu个数来并发压缩,无法检测个数的话,则默认并发8个线程,也可以使用-p指定线程数。需要注意的是其CPU使用比较高。
安装
yum install pigz
使用方法
1 $ pigz --help 2 Usage: pigz [options] [files ...] 3 will compress files in place, adding the suffix '.gz'. If no files are 4 specified, stdin will be compressed to stdout. pigz does what gzip does, 5 but spreads the work over multiple processors and cores when compressing. 6 7 Options: 8 -0 to -9, -11 Compression level (11 is much slower, a few % better) 9 --fast, --best Compression levels 1 and 9 respectively 10 -b, --blocksize mmm Set compression block size to mmmK (default 128K) 11 -c, --stdout Write all processed output to stdout (won't delete) 12 -d, --decompress Decompress the compressed input 13 -f, --force Force overwrite, compress .gz, links, and to terminal 14 -F --first Do iterations first, before block split for -11 15 -h, --help Display a help screen and quit 16 -i, --independent Compress blocks independently for damage recovery 17 -I, --iterations n Number of iterations for -11 optimization 18 -k, --keep Do not delete original file after processing 19 -K, --zip Compress to PKWare zip (.zip) single entry format 20 -l, --list List the contents of the compressed input 21 -L, --license Display the pigz license and quit 22 -M, --maxsplits n Maximum number of split blocks for -11 23 -n, --no-name Do not store or restore file name in/from header 24 -N, --name Store/restore file name and mod time in/from header 25 -O --oneblock Do not split into smaller blocks for -11 26 -p, --processes n Allow up to n compression threads (default is the 27 number of online processors, or 8 if unknown) 28 -q, --quiet Print no messages, even on error 29 -r, --recursive Process the contents of all subdirectories 30 -R, --rsyncable Input-determined block locations for rsync 31 -S, --suffix .sss Use suffix .sss instead of .gz (for compression) 32 -t, --test Test the integrity of the compressed input 33 -T, --no-time Do not store or restore mod time in/from header 34 -v, --verbose Provide more verbose output 35 -V --version Show the version of pigz 36 -z, --zlib Compress to zlib (.zz) instead of gzip format 37 -- All arguments after "--" are treated as files
原目录大小:
1 [20:30 root@hulab /DataBase/Human/hg19]$ du -h 2 8.1G ./refgenome 3 1.4G ./encode_anno 4 4.2G ./hg19_index/hg19 5 8.1G ./hg19_index 6 18G .
接下来我们分别使用gzip以及不同线程数的pigz对h19_index目录进行压缩,比较其运行时间。
1 ### 使用gzip进行压缩(单线程) 2 [20:30 root@hulab /DataBase/Human/hg19]$ time tar -czvf index.tar.gz hg19_index/ 3 hg19_index/ 4 hg19_index/hg19.tar.gz 5 hg19_index/hg19/ 6 hg19_index/hg19/genome.8.ht2 7 hg19_index/hg19/genome.5.ht2 8 hg19_index/hg19/genome.7.ht2 9 hg19_index/hg19/genome.6.ht2 10 hg19_index/hg19/genome.4.ht2 11 hg19_index/hg19/make_hg19.sh 12 hg19_index/hg19/genome.3.ht2 13 hg19_index/hg19/genome.1.ht2 14 hg19_index/hg19/genome.2.ht2 15 16 real 5m28.824s 17 user 5m3.866s 18 sys 0m35.314s 19 ### 使用4线程的pigz进行压缩 20 [20:36 root@hulab /DataBase/Human/hg19]$ ls 21 encode_anno hg19_index index.tar.gz refgenome 22 [20:38 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 4 > index_p4.tat.gz 23 hg19_index/ 24 hg19_index/hg19.tar.gz 25 hg19_index/hg19/ 26 hg19_index/hg19/genome.8.ht2 27 hg19_index/hg19/genome.5.ht2 28 hg19_index/hg19/genome.7.ht2 29 hg19_index/hg19/genome.6.ht2 30 hg19_index/hg19/genome.4.ht2 31 hg19_index/hg19/make_hg19.sh 32 hg19_index/hg19/genome.3.ht2 33 hg19_index/hg19/genome.1.ht2 34 hg19_index/hg19/genome.2.ht2 35 36 real 1m18.236s 37 user 5m22.578s 38 sys 0m35.933s 39 ### 使用8线程的pigz进行压缩 40 [20:42 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 8 > index_p8.tar.gz 41 hg19_index/ 42 hg19_index/hg19.tar.gz 43 hg19_index/hg19/ 44 hg19_index/hg19/genome.8.ht2 45 hg19_index/hg19/genome.5.ht2 46 hg19_index/hg19/genome.7.ht2 47 hg19_index/hg19/genome.6.ht2 48 hg19_index/hg19/genome.4.ht2 49 hg19_index/hg19/make_hg19.sh 50 hg19_index/hg19/genome.3.ht2 51 hg19_index/hg19/genome.1.ht2 52 hg19_index/hg19/genome.2.ht2 53 54 real 0m42.670s 55 user 5m48.527s 56 sys 0m28.240s 57 ### 使用16线程的pigz进行压缩 58 [20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 16 > index_p16.tar.gz 59 hg19_index/ 60 hg19_index/hg19.tar.gz 61 hg19_index/hg19/ 62 hg19_index/hg19/genome.8.ht2 63 hg19_index/hg19/genome.5.ht2 64 hg19_index/hg19/genome.7.ht2 65 hg19_index/hg19/genome.6.ht2 66 hg19_index/hg19/genome.4.ht2 67 hg19_index/hg19/make_hg19.sh 68 hg19_index/hg19/genome.3.ht2 69 hg19_index/hg19/genome.1.ht2 70 hg19_index/hg19/genome.2.ht2 71 72 real 0m23.643s 73 user 6m24.054s 74 sys 0m24.923s 75 ### 使用32线程的pigz进行压缩 76 [20:43 root@hulab /DataBase/Human/hg19]$ time tar -cvf - hg19_index/ | pigz -p 32 > index_p32.tar.gz 77 hg19_index/ 78 hg19_index/hg19.tar.gz 79 hg19_index/hg19/ 80 hg19_index/hg19/genome.8.ht2 81 hg19_index/hg19/genome.5.ht2 82 hg19_index/hg19/genome.7.ht2 83 hg19_index/hg19/genome.6.ht2 84 hg19_index/hg19/genome.4.ht2 85 hg19_index/hg19/make_hg19.sh 86 hg19_index/hg19/genome.3.ht2 87 hg19_index/hg19/genome.1.ht2 88 hg19_index/hg19/genome.2.ht2 89 90 real 0m17.523s 91 user 7m27.479s 92 sys 0m29.283s 93 94 ### 解压文件 95 [21:00 root@hulab /DataBase/Human/hg19]$ time pigz -p 8 -d index_p8.tar.gz 96 97 real 0m27.717s 98 user 0m30.070s 99 sys 0m22.515s
各个压缩时间的比较:
| 程序 | 线程数 | 时间 |
|---|---|---|
| gzip | 1 | 5m28.824s |
| pigz | 4 | 1m18.236s |
| pigz | 8 | 0m42.670s |
| pigz | 16 | 0m23.643s |
| pigz | 32 | 0m17.523s |
从上面可以看出,使用多线程pigz进行压缩能进行大大的缩短压缩时间,特别是从单线程的gzip到4线程的pigz压缩时间缩短了4倍,继续加多线程数,压缩时间减少逐渐不那么明显。
虽然pigz能大幅度的缩短运行时间,但这是以牺牲cpu为代价的,所以对于cpu使用较高的场景不太宜使用较高的线程数,一般而言使用4线程或8线程较为合适。
浙公网安备 33010602011771号