HyperLogLog简介

Unique items can be difficult to count. Usually this means storing every unique item then recalling this information somehow. With Redis, this can be accomplished by using a set and a single command, however both the storage and time complexity of this with very large sets is prohibitive. HyperLogLog provides a probabilistic alternative.

HyperLogLog is similar to a Bloom filter internally as it runs items through a non-cryptographic hash and sets bits in a form of a bitfield. Unlike a Bloom filter, HyperLogLog keeps a counter of items that is incremented when new items are added that have not been previously added. This provides a very low error rate when estimating the unique items (cardinality) of a set. HyperLogLog is built right into Redis.

There are three HyperLogLog commands in Redis: PFADD, PFCOUNT and PFMERGE.

What is HyperLogLog? | Redisson [下文与原文有点不同，原文的bytes应为bits]

Interested readers can check out the original HyperLogLog research paper.

The basic concept of HyperLogLog is fairly simple to understand. If you observe an element in the set that consists of n bits, this increases the likelihood that the cardinality of the set is 2 to the power of n. For example, suppose we sample a set of integers and get the value 6, which is represented as the number 110 in binary. Since this number's representation has 3 bits, this makes it likely that the set's cardinality is 8 (i.e. 2 to the power of 3). 在信息极其有限的情况下，这么猜最筷子里拔旗杆。

The cardinality of a set is a measure of the "number of elements" of the set. For example, the set A={2,4,6} contains 3 elements, and therefore A has a cardinality of 3. cardinal number = 基数(词)。First, second, third是序数词。

Of course, observing just a single element will probably not lead to a good estimate on its own. Instead, the HyperLogLog algorithm splits the set into many smaller subsets, and repeatedly takes samples from each of these subsets, refining its estimate as it sees more samples. Finally, we take the mean of each estimate from each subset to come up with a final estimate for the set's cardinality.

My favorite algorithm (and data structure): HyperLogLog (odino.org) [下文与原文有点不同，我可能理解有误]

Ask someone for the last 5 digits of their phone number. Let's suppose you get 54701. No leading zero, the length of the longest sequence of zeroes is 0.
The next person you talk to tells you it's 02561 - there's a leading zero. So the length of the longest sequence of zeros comes to 1.
Normally, you have to speak to about 50,000 people to find someone with 00000 as the last 5 digits of his/her phone number. 00000~99999, so your guess is 10⁴~ 10⁵
好像，对于十进制而言，找1, 3, 4, 5...的连续序列也是一样的。对于二进制而言，找0或找1是一样的。十进制找12345也对，但不必要地麻烦了。

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10⁹ with a typical accuracy (standard error) of 2%, using 1.5 kB of memory. HyperLogLog is an extension of the earlier LogLog algorithm, itself deriving from the 1984 Flajolet–Martin algorithm.

The HyperLogLog algorithm is able to estimate cardinalities of > 10⁹ ... 是不是少了个"能且仅能"？:-) 元素个数少了是不是误差很大？Redis的例子倒是4个元素都能数对，有没有可能元素个数少时精确计数，多时估计？像STL的sort那样，元素个数少时用插入排序，多的时候才快速排序。

posted @ 2022-01-24 22:22 华容道专家阅读(137) 评论(0) 收藏举报

刷新页面返回顶部

Penilum meum pullo sententia Latin a est 「通过浪费时间获得快乐」

HyperLogLog简介