/* The Redis HyperLogLog implementation is based on the following ideas:
Redis的超对数实现是基于以下的想法:
* * The use of a 64 bit hash function as proposed in [1], in order to don't
* limited to cardinalities up to 10^9, at the cost of just 1 additional
* bit per register.
使用文1推荐的64位的哈希函数,仅用对每个寄存器增加一个比特的代价,就可以突破基数个数10的9次方的限制。
* * The use of 16384 6-bit registers for a great level of accuracy, using
* a total of 12k per key.
为了获得一个比较高的精度,使用16384个6比特的寄存器,每个键值使用了12k(2的14次方 * 6 = 12k)
* * The use of the Redis string data type. No new type is introduced.
* * No attempt is made to compress the data structure as in [1]. Also the
* algorithm used is the original HyperLogLog Algorithm as in [2], with
* the only difference that a 64 bit hash function is used, so no correction
* is performed for values near 2^32 as in [1].
使用的是redis的字符串类型,没有引入新的数据类型。
没有采用文1中压缩数据结构的算法。同时使用了文2中原始的超对数算法,唯一不同的是使用了64位的哈希函数,
所以不用对类似文1中接近2^32的值做修正。
* [1] Heule, Nunkesser, Hall: HyperLogLog in Practice: Algorithmic
* Engineering of a State of The Art Cardinality Estimation Algorithm.
*
* [2] P. Flajolet, Eric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The
* analysis of a near-optimal cardinality estimation algorithm.
(这两篇文章留到后面再慢慢翻译)
* Redis uses two representations:
redis使用两种表示方式:
* 1) A "dense" representation where every entry is represented by
* a 6-bit integer.
密集表示方式 每个实体用6比特整数来表示
* 2) A "sparse" representation using run length compression suitable
* for representing HyperLogLogs with many registers set to 0 in
* a memory efficient way.
稀疏表示方式 使用行程长的压缩方式表示具有很多个0值寄存器的超级日志,这是一中非常有效的内存表示方法
*
* HLL header HLL的头部
* ===
*
* Both the dense and sparse representation have a 16 byte header as follows:
密集和稀疏的表示都拥有一个16个字节的如下头部:
* +------+---+-----+----------+
* | HYLL | E | N/U | Cardin. |
* +------+---+-----+----------+
*
* The first 4 bytes are a magic string set to the bytes "HYLL".
* "E" is one byte encoding, currently set to HLL_DENSE or
* HLL_SPARSE. N/U are three not used bytes.
开始的4个字节是一个魔法数,被设置为HYLL.
E表示一个字节的编码,当前用来设置HLL_DENSE或者HLL_SPARSE。N/U是3个还么有被使用的字节。
* The "Cardin." field is a 64 bit integer stored in little endian format
* with the latest cardinality computed that can be reused if the data
* structure was not modified since the last computation (this is useful
* because there are high probabilities that HLLADD operations don't
* modify the actual data structure and hence the approximated cardinality).
Cardin.字段 是一个小端格式保存的最近计算得到64位的集合中元素个数。
如果自从上次计算之后数据结构没有发生变化,那么可以这个数值可以重复使用。
(这个是很有用的,因为HLLADD操作大概率不会修改实际的数据结构,从而不需要修改近似的集合中元素个数)
* When the most significant bit in the most significant byte of the cached
* cardinality is set, it means that the data structure was modified and
* we can't reuse the cached value that must be recomputed.
一旦缓存基数的最重要字节的最重要的比特位被设置,
那就意味着数据结构被修改并且我们不能使用缓存的值,必须要重新计算
* Dense representation 密集表示
* ===
*
* The dense representation used by Redis is the following:
redis使用的密集表示如下
* +--------+--------+--------+------// //--+
* |11000000|22221111|33333322|55444444 .... |
* +--------+--------+--------+------// //--+
*
* The 6 bits counters are encoded one after the other starting from the
* LSB to the MSB, and using the next bytes as needed.
6个比特位从lsb到msb依次编码,根据情况使用下一个字节的比特位。
* Sparse representation 稀疏表示
* ===
*
* The sparse representation encodes registers using a run length
* encoding composed of three opcodes, two using one byte, and one using
* of two bytes. The opcodes are called ZERO, XZERO and VAL.
稀疏表示编码寄存器使用了包含3种操作符的行程长编码,其中两个操作符使用1个字节,一个操作符使用两个字节。
这三个操作符分别为 ZERO, XZERO and VAL.
* ZERO opcode is represented as 00xxxxxx. The 6-bit integer represented
* by the six bits 'xxxxxx', plus 1, means that there are N registers set
* to 0. This opcode can represent from 1 to 64 contiguous registers set
* to the value of 0.
ZERO操作符表示为00xxxxxx。6比特的整数用六个比特的'xxxxxx'表示,还要加上1,即意味着有N个寄存器被设置为0.
这个操作符可以表示从1到64个连续为0的寄存器。
* XZERO opcode is represented by two bytes 01xxxxxx yyyyyyyy. The 14-bit
* integer represented by the bits 'xxxxxx' as most significant bits and
* 'yyyyyyyy' as least significant bits, plus 1, means that there are N
* registers set to 0. This opcode can represent from 0 to 16384 contiguous
* registers set to the value of 0.
XZERO操作符用两个字节01xxxxxx yyyyyyyy来表示。14个比特的整数用高6位的比特'xxxxxx'和低8位的比特'yyyyyyyy'表示。
同时还需要加1,即意味着N个寄存器设置为0.这个操作符能够代表从0(加1的话,这里应该为1)到16384个连续为0的寄存器。
* VAL opcode is represented as 1vvvvvxx. It contains a 5-bit integer
* representing the value of a register, and a 2-bit integer representing
* the number of contiguous registers set to that value 'vvvvv'.
* To obtain the value and run length, the integers vvvvv and xx must be
* incremented by one. This opcode can represent values from 1 to 32,
* repeated from 1 to 4 times.
VAL操作符由1vvvvvxx表示,其中'vvvvv'5比特的整数表示寄存器的值,'xx'2比特的整数表示连续的值为'vvvvv'的寄存器个数。
为了获取值和行程长,整数 vvvvv 和 xx 必须加1.这个操作符表示从1到32的值,和1到4的重复次数。
* The sparse representation can't represent registers with a value greater
* than 32, however it is very unlikely that we find such a register in an
* HLL with a cardinality where the sparse representation is still more
* memory efficient than the dense representation. When this happens the
* HLL is converted to the dense representation.
稀疏表示不能表示值超过值为32的寄存器,然而,不可能在HLL算法中找到这样一个寄存器,稀疏表示任然比密集表示更加内存有效。
如果这种情况发生,那么HLL算法就改用密集表示。
* The sparse representation is purely positional. For example a sparse
* representation of an empty HLL is just: XZERO:16384.
稀疏表示纯粹是位置表示。举一个稀疏表示的例子,一个空的HLL表示,就是 XZERO:16384
* An HLL having only 3 non-zero registers at position 1000, 1020, 1021
* respectively set to 2, 3, 3, is represented by the following three
* opcodes:
一个HLL只有3个非零的寄存器,分别在位置1000, 1020, 1021,对应的值为2,3,3, 可以用如下的操作符来表示
* XZERO:1000 (Registers 0-999 are set to 0) 从0到999的寄存器都是0
* VAL:2,1 (1 register set to value 2, that is register 1000) 寄存器1000的值是2
* ZERO:19 (Registers 1001-1019 set to 0) 寄存器1001到1019的寄存器都是0
* VAL:3,2 (2 registers set to value 3, that is registers 1020,1021) 寄存器1020,1021的值都是3
* XZERO:15362 (Registers 1022-16383 set to 0) 寄存器1022到16383的值都是0
*
* In the example the sparse representation used just 7 bytes instead
* of 12k in order to represent the HLL registers. In general for low
* cardinality there is a big win in terms of space efficiency, traded
* with CPU time since the sparse representation is slower to access:
在上面的例子中,稀疏表示值使用了7个字节而不是12K表示了HLL寄存器。一般来说,对于低基数,在空间效率方面有很大优势,
因为稀疏表示访问速度比较慢,可以用CPU的时间来交换。
* The following table shows average cardinality vs bytes used, 100
* samples per cardinality (when the set was not representable because
* of registers with too big value, the dense representation size was used
* as a sample).
下面的表格展示了平均基数 和 使用字节的对比,
每基数100个样本(当集合由于值太大而无法表示时,使用密集表示大小作为样本)
* 100 267
* 200 485
* 300 678
* 400 859
* 500 1033
* 600 1205
* 700 1375
* 800 1544
* 900 1713
* 1000 1882
* 2000 3480
* 3000 4879
* 4000 6089
* 5000 7138
* 6000 8042
* 7000 8823
* 8000 9500
* 9000 10088
* 10000 10591
*
* The dense representation uses 12288 bytes, so there is a big win up to
* a cardinality of ~2000-3000. For bigger cardinalities the constant times
* involved in updating the sparse representation is not justified by the
* memory savings. The exact maximum length of the sparse representation
* when this implementation switches to the dense representation is
* configured via the define server.hll_sparse_max_bytes.
*/
密集表示法使用12288字节,因此在基数为~2000-3000的情况下有很大的优势。
对更大的基数,更新稀疏表示的常量时间不是简单的内存节省。
当实现切换到密集表示的时,准确的稀疏表示的最大长度被配置在变量 server.hll_sparse_max_bytes中
config.c中 默认的初始化配置为3000
createSizeTConfig("hll-sparse-max-bytes", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.hll_sparse_max_bytes, 3000, MEMORY_CONFIG, NULL, NULL)