redis6.0.5之zset阅读笔记2--跳跃列表(zskiplist)之论文翻译2算法分析(这部分读的有点辛苦)

ANALYSIS OF SKIP LIST ALGORITHMS 跳跃链表的算法分析

The time required to execute the Search, Delete and Insert operations is dominated by the time required to search for the appropriate element.
For the Insert and Delete operations,there is an additional cost proportional to the level of the node being inserted or deleted.
The time required to find an element is proportional to the length of the search path, 
which is determined by the pattern in which elements with different levels appear as we traverse the list.
执行查找，删除和插入操作所需的时间 主要由 查找适当元素所需时间决定。
对于插入和删除操作，还有一个额外的成比例消耗对应节点插入和删除操作(就是需要处理链表的层级)
查找一个元素所需时间是和 查找路径的长度 成比例的(越长消耗时间越多)
这个是由我们遍历链表时不同层级的元素出现的模式所决定的(上层定位排除的元素越多，那么下层就越快)

Probabilistic Philosophy 概率哲学
The structure of a skip list is determined only by the number elements in the skip list and the results of consulting the random number generator. 
The sequence of operations that produced the current skip list does not matter. 
We assume an adversarial user does not have access to the levels of nodes;
otherwise, he could create situations with worst-case running times by deleting all nodes that were not level 1.
一个跳跃链表的结构只取决于 跳跃列表中元素的个数 和 查询随机数生成器的结果。
和产生当前链表操作序列无关。 
我们假设一个对手没有直接获取节点层级的权限，否则，他能创建运行时间最差的情况，
即删除所有不在层级1的节点(就是只有一层的链表，就是普通的链表)

The probabilities of poor running times for successive operations on the same data structure are NOT independent; 
two successive searches for the same element will both take exactly the same time. More will be said about this later.
在短时间内连续操作同一数据结构的概率不是独立的。
两个连续对同一元素的查找都要花费同样的时间，关于这点我们在后面会详细讨论。

Analysis of expected search cost 期望查找成本的分析
We analyze the search path backwards, travelling up and to the left. 
Although the levels of nodes in the list are known and fixed when the search is performed, 
we act as if the level of a node is being determined only when it is observed while backtracking the search path.
我们向后分析查找路径，向上向左遍历。
虽然执行查找时链表中节点的层级已知并且固定，
我们的行为就如  回溯查找路径时 观察 节点的层级正在被确定。

At any particular point in the climb,  we are at a situation similar to situation a in Figure 6 –
we are at the i th forward pointer of a node x and we have no knowledge about the levels of nodes to the left of x or about the level of x,
other than that the level of x must be at least i. 
Assume the x is not the header (the is equivalent to assuming the list extends infinitely to the left).
If the level of x is equal to i, then we are in situation b. If the level of x is greater than i, then we are in situation c.
The probability that we are in situation c is p. Each time we are in situation c, we climb up a level. 
Let C(k) = the expected cost (i.e, length) of a search path that climbs up k levels in an infinite list:
在任何爬过程中的特殊点，我们处于一种类似与图6a的情况-
我们处于节点x第i个前向的指针并且我们不知道x左边节点的层级 也不知道x本身的层级，只是x所处的层级必须是i层
假设x不是头节点(这等价于链表无限向左扩展)。
如果x的层级是i,那么我们就会到达b的状态。如果x的层级比i大，那么我们就会到达c的状态。
到达c状态的概率是p.每次我们在c状态，我们向上爬一个层级。
让 C(k) 表示  在无限链表中向上爬k层的查找路径的期望代价(例如长度) :
C(0) = 0 无需向上爬
C(k) = (1–p) (cost in situation b) + p (cost in situation c) 
        不跳层的情况，概率为1-p      跳层的情况，概率为p
(我们来看第一层的例子 C(1) = (1–p) (1 + C(1)) + p (1 + C(1–1)) ) => C(1) = 1/p )
By substituting and simplifying, we get: 通过替换和简化，我们得到:
C(k) = (1–p) (1 + C(k)) + p (1 + C(k–1))
(1-p)表示保持在同一层级的概率 1+ C(k) 中的是1是多爬了1个节点 ， C(k)是这层查找花费的代价
p表示跳层的概率(因为我们是按照p的概率分层的，所以跳层的概率为p， C(k–1)表示在下一层查找花费的代价
C(k) = 1/p + C(k–1)  C(k) 是个等差数列，公差为1/p 故 C(k) = C(0) + k*1/p
C(k) = k/p

-----------FIGURE 6 - Possible situations in backwards traversal of the search path--------
         Need to climb k levels from here(SP处)
                    -----
                    | SP|--->
                    -----
                    | x |    
                **situation a**     ---------------
                      |                         |
                  probability = 1-p          probability = p
                      |                         |              
                      V                         V
-----                    -----                -----
| SP|------------------->|   |--->            | SP|--->
-----   -----     -----  -----                -----
|   |    |   | ... |   |  | x |                |   |--->
situation b                                   -----
Still need to climb k levels from here(SP处)  | x |
                                            situation c ， Need to climb  only k-1 levels from here(SP处)
              
------------FIGURE 6 - Possible situations in backwards traversal of the search path-------------

Our assumption that the list is infinite is a pessimistic assumption. 
When we bump into the header in our backwards climb,we simply climb up it, without performing any leftward movements.
This gives us an upper bound of (L(n)–1)/p on the expected length of the path that climbs from level 1 to level L(n) in a list of n elements.
我们假设链表是无限长是相当悲观的一个假设。当我们向后爬时遇到了头部，我们直接爬上去，不做执行任何向左的移动。
这个给我们在一个拥有n个元素的链表中从层级1爬到L(n)时期望长度路径的上限值 (L(n)–1)/p (直接用上面的公式C(k)=k/p可得)

We use this analysis go up to level L(n) and use a different analysis technique for the rest of the journey.
The number of leftward movements remaining is bounded by the number of elements of level L(n) or higher in the entire list,
which has an expected value of 1/p.
我们使用这个来分析爬到L(n)层并且在接下来的叙述中使用一种不同的分析技术。
向左移动的剩余数量边界 是由整个链表中 L(n) 层或者更高层的元素数量决定的， 它的期望值是1/p

We also move upwards from level L(n) to the maximum level in the list.
The probability that the maximum level of the list is a greater than k is equal to 1–(1–p的k次方)的n次方, 
which is at most np的k次方. We can calculate the expected maximum level is at most L(n) + 1/(1–p). 
Putting our results together, we find
Total expected cost to climb out of a list of n elements
≤ L(n)/p + 1/(1–p)
which is O(log n).
我们也向上移动从L(n)到链表最高层。链表最高层大于k的概率是 1–(1–p的k次方)的n次方
(上面公式如何得到? 我们的分析如下 
每个节点的层级超过k的概率为  
(p的k次方)*(1-p)      +            (p的k+1次方)*(1-p)                   +...+      (p的t次方)*(1-p) ，t趋向无穷大
第1到k次均小于p,k+1次的时候大于p   第1到k+1次均小于p,k+1次的时候大于p   依次类推 直道无穷
故总的概率为 根据等比公式 为 ：   (p的k次方)*(1-p) /(1-p) = (p的k次方)
或者直接理解 只要连续k次的值均小于p，那么后续不管什么情况，都是满足条件的，即直接得到 (p的k次方)

每个节点的层级不超过k的概率为   1- (p的k次方)， 共有n个节点 ，全部不超过的概率为 (1–p的k次方)的n次方，
那么只要有一个节点的层级超过k的概率为 1 - ( (1–p的k次方)的n次方 ),即得到上文的结果 ,完毕)

这个值约等于 n * ( p的k次方 )
(利用二项式定理，当x比较小的时候，(1-x)的n次方 约等于 1-nx,就可以得到上述结论 )

我们能够计算 期望的最大层级约等于  L(n) + 1/(1–p),
(我们猜测如何得到上面公式的分析： 因为 L = log1/p(n) ，所以 n*(p的L次方) = 1，
而根据上面得到的结论 链表最高层大于k的概率约等于 n * ( p的k次方 )，
所以链表最高层大于L的概率约等于 n * ( p的L次方 ) 即大于L层的概率刚好为1(而且到第L层恰好只剩下一个节点) 根据这个信息下面我们计算最高层的数学期望：
期望最高层 = (L+1) * (1-p)   +  (L+2) * (1-p) * p  +  (L+3) * (1-p) * (p的2次方) + 。。。
             第L+1的概率        第L+2的概率            第L+3的概率
这是一个等差乘以等比的数列， 故可以求得 期望最高层 = L + 1/(1–p))

把这些结果放在一起，我们可以发现，总的爬遍一个那个元素的链表的期望代价 小于等于 L(n)/p + 1/(1–p)，
(如何得到这个公式？爬k层的代价是k/p, 那么爬 (L(n) + 1/(1–p) - 1) 层的代价就是  (L(n) + 1/(1–p) - 1)/p = L(n)/p + 1/(1–p))
所以这个算法复杂度为O(log n  (这一段的翻译花费了近2天的时间才想明白)


Number of comparisons  比较次数
Our result is an analysis of the “length” of the search path.
The number of comparisons required is one plus the length of the search path 
(a comparison is performed for each position in the search path,
the “length” of the search path is the number of hops between positions in the search path).
我们的(上面获得)结果是查找长度的分析。需要在查找长度的基础上加上1就是比较次数。(因为最后一次比较完)
(在查找路径上的每个位置都需要执行一次比较，查找路径的长度 就是 查找路径中位置之间的跳跃次数)

Probabilistic Analysis 概率分析
It is also possible to analyze the probability distribution of search costs. 
The probabilistic analysis is somewhat more complicated (see box). From the probabilistic analysis, 
we can calculate an upper bound on the probability that the actual cost of a search exceeds the expected cost by more than a specified ratio。
Some results of this analysis are shown in Figure 8.
分析查找代价的概率分布式可能的。 概率分布稍微复杂一些(见方框，可能是FIGURE 6)。从这个概率分布，
我们能够计算查找花费的实际代价超过预期代价 超过指定比率的上限值
这个分析的一些结果展示在图8.
****************************FIGURE 8 Ratio of actual cost to expected cost****************************************
线条1  p = 1/4, n = 256
线条2  p = 1/4, n = 4,096
线条3  p = 1/4, n = 65,536

线条4  p = 1/2, n = 256
线条5  p = 1/2, n = 4,096
线条6  p = 1/2, n = 65,536

332211111111--------------------------------------------------------------------------------------|  1
543-222----1111111111111111111111-----------------------------------------------------------------| 
54433-222222---------------------11111111111111111------------------------------------------------|  10的-1次 即0.1
6544-333----222222222222--------------------------11111111111111111111----------------------------| 
65-44---3333------------2222222222222222222222-----------------------11111111111111111------------|  10的-2次
65--444-----33333333--------------------------22222222222222222222--------------------111111111111| 
65----4444---------33333333333333333333--------------------------222222222222222222---------------|  10的-3次
6-555----444---------------------------333333333333333333333333-------------------2222222222222222| 
6--555------444444---------------------------------------------3333333333333333333----------------|  10的-4次
6----555---------44444444444444---------------------------------------------------3333333333333333|  
-66-----5555555----------------44444444444444444--------------------------------------------------|  10的-5次
---666---------5555555555555--------------------44444444444444444---------------------------------|  
------6666---------------5555555555555---------------------------44444444444444444----------------|  10的-6次
----------66666-----------------------55555555555555555555------------------------4444444444444444|  
---------------6666666666---------------------------------5555555555555555555---------------------|  10的-7次
----------------------666666666666666---------------------------------------5555555555555---------|  
-------------------------------------66666666666666666-----------------------------------55555----|  10的-8次
------------------------------------------------------66666666666666--------------------------5555| 
1-----------------------------------------------2-------------------66666666666-------------------3  10的-9次

****************************FIGURE 8 Ratio of actual cost to expected cost***************************************************
FIGURE 8 - This graph shows a plot of an upper bound on the probability of a search taking substantially longer than expected.
图8-这个图展示了一些列 查找花费的时间远大于期望时间的 概率上限图
The vertical axis show the probability that the length of the search path for a search exceeds the average length by more than the
ratio on the horizontal axis. 
纵轴表示查找路径长度超过平均长度的概率大于横轴上的比率。
For example, for p = 1/2 and n = 4096, the probability that the search path will be more than three
times the expected length is less than one in 200 million. This graph was calculated using our probabilistic upper bound.
例如，对于p=1/2和n=4096，查找路径超过预期长度三倍的概率小于2亿分之一。这个图是用我们的概率上界计算出来的。
*********************************************************************************************************************************

Choosing p 选择p
Table 1 gives the relative times and space requirements for different values of p. 
Decreasing p also increases the variability of running times. 
If 1/p is a power of 2, it will be easy to generate a random level from a stream of random bits 
(it requires an average of (log2(1/p))/(1–p) random bits to generate a random level). 
Since some of the constant overheads are related to L(n) (rather than L(n)/p), 
choosing p = 1/4 (rather than 1/2) slightly improves the constant factors of the speed of the algorithms as well.
I suggest that a value of 1/4 be used for p unless the variability of running times is a primary concern,
in which case p should be 1/2.
表格1给出了不同p所需的时间和空间关系。
减小p就会增加运行时间的可变性(就是值取值范围比较广)
如果1/p是2的幂，从一个随机比特流生成一个随机层次是很容易的
(需要 平均(log2(1/p))/(1–p) 个随机比特就可以产生所需的随机层次)
(这里的(log2(1/p))/(1–p)是怎么来的？
1/(1–p)是平均查找的次数  总次数为 n+np+np的2次方+...= n/(1–p)  平均后为  1/(1–p)
log2(1/p) 是每次需要的随机比特位数，1位随机比特可以表示0，1两种情况， 2位随机比特可以表示00,01,10,11四种状态，以此类推
刚好是我们需要获取概率的对应的比特位数，所以总的需要随机比特数是  log2(1/p))/(1–p))

因为一些不变的开销和L(n)有关(而不是L(n)/p)，选择p等于1/4（而不是1/2）也轻微提高了算法速度的不变开销(即常量因子)
(这里因为L(n)=log1/p(n)，当p变小时，n不变的情况下，L(n)就变小,变的开销和L(n)有关, L(n)小常量也小)。
所以本文作者建议p使用 1/4， 除非运行时间的变化是首要考虑因素(因为从图8可以看出，p=1/2执行时间更加稳定)，在这种情况下p选择1/2*******************TABLE 1 – Relative search speed and space requirements,depending on the value of p.**********
—————————————————————————————————————————————————————————————————————————————————————————————————————————————
p   Normalized search  times (i.e., normalized L(n)/p )      Avg. # of pointers per node (i.e., 1/(1 – p))
—————————————————————————————————————————————————————————————————————————————————————————————————————————————
1/2     1                                                         2
1/e     0.94...                                                    1.58...
1/4      1                                                      1.33...
1/8      1.33...                                                1.14...
1/16     2                                                      1.07...
——————————————————————————————————————————————————————————————————————————————————————————————————————————————
*******************TABLE 1 – Relative search speed and space requirements,depending on the value of p.***********

Sequences of operations 操作顺序
The expected total time for a sequence of operations is equal to the sum of the expected times of each of the operations in the sequence.
Thus, the expected time for any sequence of m searches in a data structure that contains n elements is O(m log n).
However, the pattern of searches affects the probability distribution of the actual time to perform the entire sequence of operations.
操作序列的预期花费总时间等于序列中每个操作的预期花费时间之和。
因此，在包含n个元素的数据结构中，任意m个操作序列的预期花费时间是O（m logn）.(每个是O(logn),m个就是O(mlogn))
但是，查找模式会影响执行整个操作序列的实际时间的概率分布。

If we search for the same item twice in the same data structure, both searches will take exactly the same amount of time. 
Thus the variance of the total time will be four times the variance of a single search. If the search times for two elements are independent, 
the variance of the total time is equal to the sum of the variances of the individual searches.
Searching for the same element over and over again maximizes the variance.
如果我们在同样的数据结构中查找相同的项，两次查找花费几乎相同的总时间。
因此总的时间方差就是 4倍于一次查找的方差，如果是两个元素的查找时间是独立的，那么总时间的方差就等于 单个查找时间的方差和
反复查找同一个元素会使方差最大化。
(因为 D(X) = E(X的平方) - [E(X)]的平方)， 所以D(2X) = E(2X的平方) - [E(2X)]的平方 = 4E(X的平方)-4[E(X)]的平方 = 4D(X) )
D(X+Y) = E( (X+Y)的平方 ) - [E(X+Y)]的平方 = E(X的平方+Y的平方+2XY) - (E(X)] +E(Y)]) *(E(X)] +E(Y)])
=E(X的平方)+E(Y的平方)+2E(XY)- [E(X)]的平方 - [E(Y)]的平方 -2E(X)E(Y) 
当X与Y独立时，E(XY) = E(X)E(Y)， 所以上式 =E(X的平方)+E(Y的平方) - [E(X)]的平方 - [E(Y)]的平方 = D(X) +D(Y) )
posted on 2021-01-29 17:21 子虚乌有阅读(100) 评论(0) 收藏举报