ZSTD相关笔记.md

相关资料

测试不同字典大小样本的压缩率情况

样本大小:102 MB (107,155,190 字节) 样本数量:173842

不使用字典进行压缩时的压缩率

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress request/request/* --output-dir-flat req-c

设置多线程参数了CPU占有率居然还是只有12.4%左右! IO居然为0;
设置内存限制了也无法多占用,只有16.8MB的内存使用量,而且完全没有波动.
读取耗时大概20多分钟
显示压缩进度之后,CPU反而降低到4%左右, 内存涨到33.8MB;IO为1.2MB左右
从10:16:07启动到10:43:46结束 总压缩耗时00:27:39
173842 files compressed : 60.55% ( 102 MiB => 61.9 MiB)

按照ZSTD最小的字典大小256训练试试

zstd --verbose --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress --train --train-cover --maxdict=256 request/request/* -o req.256.dic

CPU占有率只有12.4%, IO很少
开始时间:11:31:43
! Warning : setting manual memory limit for dictionary training data at 0 MB Training samples set too large (102 MB); training on 0 MB only...
Trying 82 different sets of parameters
d=6
Total number of training samples is 1 and is invalid.Failed to initialize context
dictionary training failed : Src size is incorrect

zstd --verbose --ultra -22 -T0 --auto-threads=logical --trace noc.log --progress --train --train-cover --maxdict=256 request/request/* -o req.256.dic

CPU占有率只有12.4%, IO很少
开始时间:11:56:32
结束时间:14:07:38
训练耗时:02:11:06
训练倍数:107155190/256=418574倍
k=146
d=6
steps=40
split=100

zstd --verbose --ultra -22 -T0 --auto-threads=logical --progress -D req.256.dic --output-dir-flat req-c-256 request/request/*

173842 files compressed : 49.99% ( 102 MiB => 51.1 MiB)
开始时间:14:24:46
结束时间:‎15:02:44
压缩耗时:00:37:58

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.256.dic -o req.256.c.dic

字典压缩后大小=101.56% ( 256 B => 260 B, req.256.c.dic)
节约效率=(100-49.99)/2561024=200.04=每KB带来200%的压缩率下降
使用字典压缩后比不用字典提升幅度=60.55-49.99=10.56%
使用字典压缩后比不用字典提升效率=(60.55-49.99)/256
1024=42.24=每KB带来42.24%的效果提升

样本平均大小:107155190/173842=616 字节

zstd --verbose --train --train-cover --maxdict=616 request/request/* -o req.616.dic

CPU占有率只有12.4%, IO很少
开始时间:18:51:07
开始出现训练提示的时间:19:12:56(读取耗时≈22分钟)
训练耗时=03:56:51
总训练耗时=04:18:40
训练倍数:173842倍
k=242
d=6
steps=40
split=100

zstd -D req.616.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-616

173842 files compressed : 37.78% ( 102 MiB => 38.6 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.616.dic -o req.616.c.dic

字典压缩后大小=89.12% ( 616 B => 549 B, req.616.c.dic)
字典压缩后大小节约效率=(100-37.78)/5491024=116.053=每KB带来116%的压缩率下降
使用字典压缩后比不用字典提升幅度=60.55-37.78=22.77%
使用字典压缩后比不用字典提升效率=(60.55-37.78)/549
1024=42.471=每KB带来42%的效果提升

比256字典提升幅度=49.99-37.78=12.21%
比256字典提升效率=(49.99-37.78)/(549-256)*1024=42.672=每KB带来42%的效果提升

按照样本平均大小的10倍来设置字典大小6166字节=6.02KB

zstd --verbose --train --train-cover --maxdict=6166 request/request/* -o req.6166.dic

CPU占有率只有12.4%, IO很少
开始时间:19:10:41
开始出现训练提示的时间:19:33:20(读取耗时≈22分钟)
总训练耗时=03:59:06
训练倍数:107155190/6166=17378倍
k=1250
d=8
steps=40
split=100

zstd -D req.6166.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-6166

173842 files compressed : 21.22% ( 102 MiB => 21.7 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6166.dic -o req.6166.c.dic

字典压缩后大小=43.01% ( 6.02 KiB => 2.59 KiB, req.6166.c.dic)
字典压缩后大小节约效率=(100-21.22)/26521024=30.419=每KB带来30%的压缩率下降
相对平均大小的字典膨胀=2652/549=4.831倍数
相对平均大小的字典效率降低到=30.419/116.053=26.2%-100=73.8%
使用字典压缩后比不用字典提升幅度=60.55-21.22=39.33%
使用字典压缩后比不用字典提升效率=(60.55-21.22)/2652
1024=15.186=每KB带来15%的效果提升;

比256字典提升幅度=49.99-21.22=28.77%
比256字典提升效率=(49.99-21.22)/(2652-256)*1024=12.296=每KB带来12%的效果提升

手动删除字典后面的可见字符串再尝试压缩效果

原始未压缩后的字典:6166字节 压缩后2652字节(压缩后就没有可见的有意义字符了)
删除可见字符串之后:0151字节 (使用16进制编辑工具删除比较准确)
尝试进行压缩没有一启动就报错,但是最后执行的时候报内存错误:
zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress -D req.6166.YeThin.dic --output-dir-flat req-c-6166.YeThin ../request/request/*

开始压缩时间:12:11:13
zstd: error 11 : Allocation error : not enough memory

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log --progress -D req.6166.YeThin.dic --output-dir-flat req-c-6166.YeThin ../request/request/*

去掉M1024参数还是不行
zstd: error 11 : Allocation error : not enough memory

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6166.YeThin.dic -o req.6166.YeThin.c.dic

字典压缩后大小=108.61% ( 151 B => 164 B, req.6166.YeThin.c.dic)

按照样本平均大小的100倍来设置字典大小61666字节=60.22KB

zstd --verbose --train --train-cover --maxdict=61666 request/request/* -o req.61666.dic

CPU占有率只有12.4%, IO很少
开始时间:19:35:56
开始出现训练提示的时间:19:57:20(读取耗时≈22分钟)
总训练耗时=03:44:13
训练倍数:107155190/61666=1737倍
k=1970
d=6
steps=40
split=100

zstd -D req.61666.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-61666

173842 files compressed : 18.49% ( 102 MiB => 18.9 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.61666.dic -o req.61666.c.dic

字典压缩后大小=24.78% ( 60.2 KiB => 14.9 KiB, req.61666.c.dic)
字典压缩后大小节约效率=(100-18.49)/152781024=5.463=每KB带来5%的压缩率下降
相对平均大小的字典膨胀=15278/549=27.829倍数
相对平均大小的字典效率降低到=5.463/116.053=4.7%-100=95.3%
使用字典压缩后比不用字典提升幅度=60.55-18.49=42.06%
使用字典压缩后比不用字典提升效率=(60.55-18.49)/15278
1024=2.819=每KB带来2%的效果提升

比256字典提升幅度=49.99-18.49=31.5%
比256字典提升效率=(49.99-18.49)/(15278-256)*1024=2.147=每KB带来2%的效果提升

按照样本平均大小的1K倍来设置字典大小616000字节=587.47KB

zstd --verbose --train --train-cover --maxdict=616000 request/request/* -o req.616000.dic

CPU占有率只有12.4%, IO很少
开始时间:18:59:18
开始出现训练提示的时间:19:21:37(读取耗时≈22分钟)
总训练耗时=03:57:46
训练倍数:173倍
k=1778
d=8
steps=40
split=100

zstd -D req.616000.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-616000

173842 files compressed : 16.00% ( 102 MiB => 16.4 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.616000.dic -o req.616000.c.dic

字典压缩后大小=19.99% ( 602 KiB => 120 KiB, req.616000.c.dic)
字典压缩后大小节约效率=(100-16.00)/1231511024=0.698=每KB带来0.69%的压缩率下降
相对平均大小的字典膨胀=123151/549=224.319倍数
相对平均大小的字典效率降低到=0.698/116.053=0.6%-100=99.4%
使用字典压缩后比不用字典提升幅度=60.55-16.00=44.55%
使用字典压缩后比不用字典提升效率=(60.55-16.00)/123151
1024=0.37=每KB带来0.37%的效果提升

比256字典提升幅度=49.99-16.00=33.99%
比256字典提升效率=(49.99-16.00)/(123151-256)*1024=0.283=每KB带来0.283%的效果提升

按照样本平均大小的1W倍来设置字典大小6160000字节=5.87MB

zstd --verbose --train --train-cover --maxdict=6160000 request/request/* -o req.6160000.dic

CPU占有率只有12.4%, IO很少
开始时间:18:57:15
开始出现训练提示的时间:19:19:35(读取耗时≈22分钟)
总训练耗时=03:58:28
训练倍数:17.3倍
k=1922
d=8
steps=40
split=100

zstd -D req.6160000.dic --ultra -22 --progress request/request/* --output-dir-flat req-c-6160000

173842 files compressed : 10.64% ( 102 MiB => 10.9 MiB)

zstd --ultra -22 -T0 --auto-threads=logical --trace noc.log -M1024 --progress req.6160000.dic -o req.6160000.c.dic

字典压缩后大小=15.15% ( 5.87 MiB => 912 KiB, req.6160000.c.dic)
字典压缩后大小节约效率=(100-10.64)/9334571024=0.098=每KB带来0.098%的压缩率下降
相对平均大小的字典膨胀=933457/549=1700.286倍数
相对平均大小的字典效率降低到=0.098/116.053=0.1%-100=99.9%
使用字典压缩后比不用字典提升幅度=60.55-10.64=49.91%
使用字典压缩后比不用字典提升效率=(60.55-10.64)/933457
1024=0.055=每KB带来0.055%的效果提升

比256字典提升幅度=49.99-10.64=39.35%
比256字典提升效率=(49.99-10.64)/(933457-256)*1024=0.043=每KB带来0.043%的效果提升


样本大小:107 KB (110,148 字节) 样本数量:306

样本平均大小:110148/306= 359.96字节

训练得到的字典大小都为:23788,超过这个大小都是一样的!

zstd --verbose --train --train-cover --maxdict=110148 req/* -o req.110148.dic
zstd --verbose --train --train-cover --maxdict=110141 req/* -o req.110141.dic
zstd --verbose --train --train-cover --maxdict=108KB req/* -o req.108KB.dic

字典大小=23788
比平均尺寸大的倍数=23788/360=66倍
训练倍数:110148/23788=4.63倍

zstd --ultra -22 --progress req.108KB.dic -o req.108KB.c.dic

req.108KB.dic : 50.25% ( 23.2 KiB => 11.7 KiB, req.108KB.c.dic)
字典压缩后大小=11.6 KB (11,954 字节)

zstd -D req.108KB.dic --ultra -22 --progress req/* --output-dir-flat req-c-108KB

306 files compressed : 17.13% ( 108 KiB => 18.4 KiB)

重新指定字典大小23788进行训练,压缩率居然更高!

zstd --verbose --train --train-cover --maxdict=23788 req/* -o req.23788.dic

字典大小=23788

zstd --ultra -22 --progress req.23788.dic -o req.23788.c.dic

req.23788.dic : 28.52% ( 23.2 KiB => 6.63 KiB, req.23788.c.dic)
字典压缩后大小=6.62 KB (6,785 字节)

zstd -D req.23788.dic --ultra -22 --progress req/* --output-dir-flat req-c-23788

306 files compressed : 14.37% ( 108 KiB => 15.5 KiB)

随便测试一个比平均大小更大的字典大小888

zstd --verbose --train --train-cover --maxdict=888 req/* -o req.888.dic

字典大小=888

zstd --ultra -22 --progress req.888.dic -o req.888.c.dic

req.888.dic : 65.99% ( 888 B => 586 B, req.888.c.dic)
字典压缩后大小=586 字节

zstd -D req.888.dic --ultra -22 --progress req/* --output-dir-flat req-c-888

306 files compressed : 32.51% ( 108 KiB => 35.0 KiB)

按照110148/306=359.961平均每个文件360的大小来指定字典大小

zstd --verbose --train --train-cover --maxdict=360 req/* -o req.360.dic

字典大小=360

zstd --ultra -22 --progress req.360.dic -o req.360.c.dic

req.360.dic : 94.44% ( 360 B => 340 B, req.360.c.dic)
字典压缩后大小=340 字节

zstd -D req.360.dic --ultra -22 --progress req/* --output-dir-flat req-c-360

306 files compressed : 47.99% ( 108 KiB => 51.6 KiB)

按照ZSTD最小的字典大小256训练试试

zstd --verbose --train --train-cover --maxdict=256 req/* -o req.256.dic

字典大小=256

zstd --ultra -22 --progress req.256.dic -o req.256.c.dic

req.256.dic :102.34% ( 256 B => 262 B, req.256.c.dic)
字典压缩后大小=262 字节(反而增大了!!!)

zstd -D req.256.dic --ultra -22 --progress req/* --output-dir-flat req-c-256

306 files compressed : 58.17% ( 108 KiB => 62.6 KiB)

不使用字典进行压缩时的压缩率

zstd --ultra -22 --progress req/* --output-dir-flat req-c

306 files compressed : 74.81% ( 108 KiB => 80.5 KiB)


训练字典的相关参数经验

以下3个参数训练出来的字典MD5居然都是一样的,看来还是不会使用这些参数:
--train-cover=shrink=2
--train-cover=shrink
--train-cover

训练字典时无法占满CPU和内存

zstd --verbose -T0 --auto-threads=logical --train -M1024 --train-cover --maxdict=616 request/request/* -o req.616.dic

设置多线程参数了CPU占有率居然还是只有12.4%左右!
设置内存限制了也无法多占用,只有16.8MB的内存使用量,而且完全没有波动.

参数相关帮助说明

一看就懂的K近邻算法(KNN),K-D树,并实现手写数字识别! - 简书

Kd-树是K-dimension tree的缩写
KD树的最近邻搜索算法

zstd(1) — zstd — Debian unstable — Debian Manpages

如果一个小数据样本家族中存在某种相关性,那么培训就会奏效。特定于数据的字典越多,它的效率就越高 (没有通用字典)。
因此,每种类型的数据部署一个字典将提供最大的好处。
字典增益大多在最初的几KB有效。然后,压缩算法将逐渐使用先前解码的内容来更好地压缩文件的其余部分。
--train-cover[=k#,d=#,steps=#,split=#,shrink[=#]]
If split is not specified or split <= 0, then the default value of 100 is used.
If shrink flag is not used, then the default value for shrinkDict of 0 is used.
If shrink is not specified, then the default value for shrinkDictMaxRegression of 1 is used.
Having shrink enabled takes a truncated dictionary of minimum size and doubles in size(尺寸翻倍)
until compression ratio of the truncated dictionary is at most shrinkDictMaxRegression%(退化率) worse than the compression ratio of the largest dictionary.
启用shrink后,会得到一个最小尺寸的截断的字典,并使其尺寸加倍,直到截断的字典的压缩率最多比最大的字典的压缩率差N%。

字典大小设置不准确时的提示

! Warning : data size of samples too small for target dictionary size
! Samples should be about 100x larger than target dictionary size
Trying 5 different sets of parameters
WARNING: The maximum dictionary size 112640 is too large compared to the source size 82775!
size(source)/size(dictionary) = 0.734863, but it should be >= 10!
This may lead to a subpar次品 dictionary!
We recommend training on sources at least 10x, and preferably 100x the size of the dictionary!

Zstandard CLI 帮助说明

*** Zstandard CLI (64-bit) v1.5.4, by Yann Collet ***

Compress or decompress the INPUT file(s); reads from STDIN if INPUT is `-` or not provided.

Usage: zstd [OPTIONS...] [INPUT... | -] [-o OUTPUT]

Options:
  -o OUTPUT                     Write output to a single file, OUTPUT.
  -k, --keep                    Preserve INPUT file(s). [Default]
  --rm                          Remove INPUT file(s) after successful (de)compression.

  -#                            Desired compression level, where `#` is a number between 1 and 19;
                                lower numbers provide faster compression, higher numbers yield
                                better compression ratios. [Default: 3]

  -d, --decompress              Perform decompression.
  -D DICT                       Use DICT as the dictionary for compression or decompression.

  -f, --force                   Disable input and output checks. Allows overwriting existing files,
                                receiving input from the console, printing ouput to STDOUT, and
                                operating on links, block devices, etc. Unrecognized formats will be
                                passed-through through as-is.

  -h                            Display short usage and exit.
  -H, --help                    Display full help and exit.
  -V, --version                 Display the program version and exit.

Advanced options:
  -c, --stdout                  Write to STDOUT (even if it is a console) and keep the INPUT file(s).

  -v, --verbose                 Enable verbose output; pass multiple times to increase verbosity.
  -q, --quiet                   Suppress warnings; pass twice to suppress errors.
  --trace LOG                   Log tracing information to LOG.

  --[no-]progress               Forcibly show/hide the progress counter. NOTE: Any (de)compressed
                                output to terminal will mix with progress counter text.

  -r                            Operate recursively on directories.
  --filelist LIST               Read a list of files to operate on from LIST.
  --output-dir-flat DIR         Store processed files in DIR.
  --[no-]asyncio                Use asynchronous IO. [Default: Enabled]

  --[no-]check                  Add XXH64 integrity checksums during compression. [Default: Add, Validate]
                                If `-d` is present, ignore/validate checksums during decompression.

  --                            Treat remaining arguments after `--` as files.

Advanced compression options:
  --ultra                       Enable levels beyond 19, up to 22; requires more memory.
  --fast[=#]                    Use to very fast compression levels. [Default: 1]
  --adapt                       Dynamically adapt compression level to I/O conditions.
  --long[=#]                    Enable long distance matching with window log #. [Default: 27]
  --patch-from=REF              Use REF as the reference point for Zstandard's diff engine.

  -T#                           Spawn # compression threads. [Default: 1; pass 0 for core count.]
  --single-thread               Share a single thread for I/O and compression (slightly different than `-T1`).
  --auto-threads={physical|logical}
                                Use physical/logical cores when using `-T0`. [Default: Physical]

  -B#                           Set job size to #. [Default: 0 (automatic)]
  --rsyncable                   Compress using a rsync-friendly method (`-B` sets block size).

  --exclude-compressed          Only compress files that are not already compressed.

  --stream-size=#               Specify size of streaming input from STDIN.
  --size-hint=#                 Optimize compression parameters for streaming input of approximately size #.
  --target-compressed-block-size=#
                                Generate compressed blocks of approximately # size.

  --no-dictID                   Don't write `dictID` into the header (dictionary compression only).
  --[no-]compress-literals      Force (un)compressed literals.
  --[no-]row-match-finder       Explicitly enable/disable the fast, row-based matchfinder for
                                the 'greedy', 'lazy', and 'lazy2' strategies.

  --format=zstd                 Compress files to the `.zst` format. [Default]
  --format=gzip                 Compress files to the `.gz` format.
  --format=xz                   Compress files to the `.xz` format.
  --format=lzma                 Compress files to the `.lzma` format.

Advanced decompression options:
  -l                            Print information about Zstandard-compressed files.
  --test                        Test compressed file integrity.
  -M#                           Set the memory usage limit to # megabytes.
  --[no-]sparse                 Enable sparse mode. [Default: Enabled for files, disabled for STDOUT.]
  --[no-]pass-through           Pass through uncompressed files as-is. [Default: Disabled]

Dictionary builder:
  --train                       Create a dictionary from a training set of files.

  --train-cover[=k=#,d=#,steps=#,split=#,shrink[=#]]
                                Use the cover algorithm (with optional arguments).
  --train-fastcover[=k=#,d=#,f=#,steps=#,split=#,accel=#,shrink[=#]]
                                Use the fast cover algorithm (with optional arguments).

  --train-legacy[=s=#]          Use the legacy algorithm with selectivity #. [Default: 9]
  -o NAME                       Use NAME as dictionary name. [Default: dictionary]
  --maxdict=#                   Limit dictionary to specified size #. [Default: 112640]
  --dictID=#                    Force dictionary ID to #. [Default: Random]

Benchmark options:
  -b#                           Perform benchmarking with compression level #. [Default: 3]
  -e#                           Test all compression levels up to #; starting level is `-b#`. [Default: 1]
  -i#                           Set the minimum evaluation to time # seconds. [Default: 3]
  -B#                           Cut file into independent chunks of size #. [Default: No chunking]
  -S                            Output one benchmark result per input file. [Default: Consolidated result]
  --priority=rt                 Set process priority to real-time.
posted @ 2023-02-24 20:30  Asion Tang  阅读(440)  评论(0编辑  收藏  举报