redis6.0.5之zset阅读笔记2--跳跃列表(zskiplist)之论文翻译4概率分析(又是无比艰难的一页,虽然跳表实现简单，但是分析不简单)

PROBABILISTIC ANALYSIS  概率分析

In addition to analyzing the expected performance of skip lists, 
we can also analyze the probabilistic performance of skip lists. 
This will allow us to calculate the probability that an operation takes longer than a specified time. 
This analysis is based on the same ideas as our analysis of the expected cost, so that analysis should be understood first.
除了分析跳跃链表的期望性能之外，我们还能分析跳跃链表的概率性能。
这个允许我们计算一个操作花费时间大于特定时间的概率。
这个分析基于我们分析期望大家同样的方法，所以这个分析应该第一个被理解。

A random variable has a fixed but unpredictable value and a predictable probability distribution and average.  If X is a random variable, 
Prob{ X = t } denotes the probability that X equals t and Prob{ X > t } denotes the probability that X is greater than t. 
For example, if X is the number obtained by throwing a unbiased die, Prob{ X > 3 } = 1/2.
一个随机变量有一个固定但不可预测的值和一个可预测的概率分布和平均值。如果X是一个随机变量，
概率{ X = t } 表示 X等于t的概率，概率{ X > t }表示X大于t的概率。
举例来说，如果X是通过一个无偏骰子投掷获得的数字，那么概率{ X > 3 }的概率是1/2。(4,5,6 占 1，2，3，4，5，6的一半)

It is often preferable to find simple upper bounds on values whose exact value is difficult to calculate. 
To discuss upper bounds on random variables, 
we need to define a partial ordering and equality on the probability distributions of nonnegative random variables.
通常，最好在难以计算精确值的值上找到简单的上界。
为了讨论随机变量的上界，我们需要在非负随机变量的概率分布上定义一个偏序和等式。

Definitions (=prob and ≤prob). 
Let X and Y be non-negative independent random variables 
(typically, X and Y would denote the time to execute algorithms A(X) and A(Y)). 
We define X ≤prob Y to be true if and only if for any value t, 
the probability that X exceeds t is less than the probability that Y exceeds t. More formally:
X =prob Y iff ∀ t, Prob{ X > t } = Prob{ Y > t } and
X ≤prob Y iff ∀ t, Prob{ X > t } ≤ Prob{ Y > t }. ■

定义（=prob和≤prob）。
设X和Y为非负独立随机变量（通常，X和Y表示执行算法A（X）和A（Y）的时间）。
我们定义X≤prob Y为真，当且仅当对于任何值t，X超过t的概率小于Y超过t的概率。更正式地说：
X =prob Y iff ∀ t, Prob{ X > t } = Prob{ Y > t } and
X ≤prob Y iff ∀ t, Prob{ X > t } ≤ Prob{ Y > t }. ■

For example, the graph in Figure 7 shows the probability distribution of three random variables X, Y and Z. 
Since the probability distribution curve for X is completely under the curves for Y and Z, X ≤prob Y and X ≤prob Z. 
Since the probability curves for Y and Z intersect, neither Y ≤prob Z nor Z ≤prob Y. 
Since the expected value of a random variable X is simply the area under the curve Prob{ X > t }, 
if X ≤prob Y then the average of X is less than or equal to the average of Y.
例如，图7中的图表显示了三个随机变量X、Y和Z的概率分布。
由于X的概率分布曲线完全在Y和Z的曲线之下，X≤prob Y和X≤prob Z。
因为Y和Z的概率曲线相交，所以Y≤prob Z和Z≤prob Y都不是。
因为随机变量X的期望值只是曲线Prob{X>t}下的面积，
如果X≤prob Y，则X的平均值小于或等于Y的平均值。  
*********************FIGURE 7 – Plots of three probability distributions****************************
x Prob{ X > t }
y Prob{ Y > t }
z Prob{ Z > t }
    1
    |xzyyyyyyy
      |xx  z    yyyy
     | xx     z    yyyy
    |   xx       z    yyy
概率|     xxx        z   yy
    |       xxxx         z yy
    |           xxxxx        zyy
    |                xxxxx     yz
    |                    xxxxx   y z        
    |                         xxxxx y z
   0|------------------------------xxxyxx-z----------------- t
*********************FIGURE 7 – Plots of three probability distributions****************************

We make use of two probability distributions:
Definition (binomial distributions — B(t, p)).
Let t be a non-negative integer and p be a probability. 
The term B(t, p) denotes a random variable equal to the number of successes seen in a series of t independent random trials 
where the probability of a success in a trial is p. The average and variance of B(t, p) are tp and tp(1 – p) respectively. ■
我们使用两种概率分布：
定义（二项式分布-B（t，p））。设t为非负整数，p为概率。
术语B（t，p）表示一个随机变量，等于t次独立随机试验中成功的次数
其中试验成功的概率为p。B（t，p）的平均值和方差分别为tp和tp（1–p）。
E(Xi) = 1 * p +  0 * (1-p) = p
因为是独立的所以等于各个随机变量期望相加
故E(X) = t * E(Xi) =  tp
或者直接套用公式
E(X)= 0 * C(n,0)* p的0次 * (1-p) 的n次  + 1 * C(n,1)* p的1次 * (1-p) 的n-1次  +2 * C(n,2)* p的2次 * (1-p) 的n-2次 +...
+ k * C(n,k)* p的k次 * (1-p) 的n-k次  + ...+ n * C(n,n)* p的n次 * (1-p) 的0次 
因为 k * C(n,k) = n C(n-1,k-1)  
所以上式可以转化为: 
np* (  C(n-1,0)* p的0次 * (1-p) 的n-1次   +   C(n-1,1)* p的1次 * (1-p) 的n-2次  + ...+ C(n-1,k-1)* p的k-1次 * (1-p) 的n-k次
+...+  C(n-1,n-1)* p的n-1次 * (1-p) 的0次 )  = np* (p + 1-p)的n次方 = np

E(Xi的平方) = p * 1的平方  +  (1-p) * 0的平方 = p
D(Xi)=E(Xi的平方)-E(Xi)的平方 = p -  p的平方 =  p *(1 - p)
或者直接用定义：
D(Xi) = p *(1 - p)的平方  +  (1-p) *(0 - p)的平方 = p *(1 - p)
所以D(X) = t * p * （1-p）


Definition (negative binomial distributions — NB(s, p)). 
Let s be a non-negative integer and p be a probability. 
The term NB(s, p) denotes a random variable equal to the number of failures seen before the (s)th success in a series of random independent trials 
where the probability of a success in a trial is p. The average and variance of NB(s, p) are s(1–p)/p and s(1–p)/p2 respectively. ■
定义（负二项分布-NB（s，p））。设s为非负整数，p为概率。
术语NB（s，p）表示一个随机变量，该随机变量等于在一系列随机独立试验中第s次成功之前所看到的失败次数
试验成功的概率为p。 NB（s，p）的均值和方差分别为s（1-p）/p和s（1-p）/( p的平方 )
E(X) = SUM( (k-s)* C(k-1,s-1) * p的s次方 * (1-p)的(k-s)次方, k=s+1,s+2 ...)
根据数学期望的定义，所有的可能性为1, 即 SUM( C(k-1,s-1) * p的s次方 * (1-p)的(k-s)次方, k=s+1,s+2 ...) = 1

E(X) = SUM( (k-s)* C(k-1,s-1) * p的s次方 * (1-p)的(k-s)次方, k=s+1,s+2 ...)
     = SUM(s * C(k-1, s)  * p的s次方 * (1-p)的(k-s)次方)
     = s * (1-p)/p * SUM(  C(k-1, s)  * p的s+1次方 * (1-p)的(k-s-1)次方)
     = s * (1-p)/p 

E(X的平方) =  SUM( (k-s)的平方 * C(k-1,s-1) * p的s次方 * (1-p)的(k-s)次方, k=s+1,s+2 ...)
           =  SUM( [s(s+1)*C(k-1,s+1)  + s * C(k-1,s) ] * p的s次方 * (1-p)的(k-s)次方, k=s+1,s+2 ...)
           =  s(s+1) * (1-p)的平方 / p的平方  + s * (1-p)/p 
所以
D(X) = E(X的平方) - E(X)的平方 =  s(s+1) * (1-p)的平方 / p的平方  + s * (1-p)/p  -  [s * (1-p)/p ]的平方
     = s * (1-p)的平方 / p的平方 + s * (1-p)/p 
     = s（1-p）/( p的平方 )
     

Probabilistic analysis of search cost  查找代价的概率分析
The number of leftward movements we need to make before we move up a level (in an infinite list) has a negative binomial distribution:
it is the number of failures (situations b’s) we see before we see the first success (situation c) in a series of independent random trials,
where the probability of success is p. Using the probabilistic notation introduced above:
在向上移动一个层级（在无限列表中）之前，我们需要向左移动的次数具有负二项分布：
在一系列独立的随机试验中，在我们看到第一次成功（情况c）之前，我们看到的失败（情况b）的数量，
其中成功概率为p。使用上述概率表示法：
Cost to climb one level in an infinite list
=prob 1+ NB(1, p).

在无限列表中爬一层的成本
=prob 1+ NB(1, p).
1 向上爬一层  +  向上爬一层前失败的次数

We can sum the costs of climbing each level to get the total cost to climb up to level L(n):
Cost to climb to level L(n) in an infinite list
=prob (L(n) – 1) + NB(L(n) – 1, p).
Our assumption that the list is infinite is a pessimistic assumption:
Cost to climb to level L(n) in a list of n elements
≤prob (L(n) – 1) + NB(L(n) – 1, p).

我们可以把攀登每一层的成本加起来，以获得升到L（n）级的总成本：
在无限链表中爬到L（n）层的成本
=prob（L（n）–1）+NB（L（n）–1，p）。
我们对链表无限长的假设是最坏情况：在n个元素的列表中爬升到L（n）层的成本事实上小于
≤prob（L（n）–1）+NB（L（n）–1，p）。

Once we have climbed to level L(n), 
the number of leftward movements is bounded by the number of elements of level L(n) or greater in a list of n elements. 
The number of elements of level L(n) or greater in a list of n elements is a random variable of the form B(n, 1/np).
一旦我们爬到L（n）层，向左移动的次数以n个元素的链表中L（n）层或更高层的元素的数量为界。
在n个元素的链表中，层级L（n）或更高的层元素的数量是形式B（n，1/np）的随机变量。
(因为单点到达L层及以上的概率 p的L-1次方= p的L次方/p = 1/np, 其中n * p的L(n)次方 = 1)

Let M be a random variable corresponding to the maximum level in a list of n elements. 
The probability that the level of a node is greater than k is p的k次方, so Prob{ M > k } = 1– (1–p的k次方) 的n次方 < n* p的k次方. 
Since n* p的k次方 = p的k–L(n)次方  and Prob{ NB(1, 1–p) + 1 > i} = p的i次方 , 
we get an probabilistic upper bound of M ≤prob L(n) + NB(1, 1 – p) + 1. 
Note that the average of L(n) + NB(1, 1 – p) + 1 is L(n) + 1/(1–p).
设M是一个随机变量，对应于n个元素列表中的最大值。
节点的级别大于k的概率是p的k次方，因此Prob{M>k}=1–（1–p的k次方)的n次方 <  n* p的k次方.(这个式子我们在第2部分已经解释过了)
因为n* p的k次方 = p的k–L(n)次方 (因为 n * p的L(n)次方 = 1， 所以 p的k–L(n)次方 = p的k次方 /p的L(n)次方 = p的k次方 * n  )
和Prob{NB（1，1-p）+1>i}=  p的i次方， 爬了i+1层 第一出现失败的情况
得到了M ≤ prob L(n) +NB（1，1-p）+1的概率上界。
(  prob L(n) +NB（1，1-p）+1 = p的k–L(n)次方，故这里i=k–L(n),代入即可得到本式)
注意，L（n）+NB（1，1–p）+1的数学期望是L（n）+1/（1–p）。
L（n）+NB（1，1–p）+1 = ( 1-(1-p)) /1-p + 1 = L（n）+1/（1–p）

This gives a probabilistic upper bound on the cost once we have reached level L(n) of B(n, 1/np) + (L(n) + NB(1, 1 – p) + 1) – L(n). 
Combining our results to get a probabilistic upper bound on the total length of the search path (i.e., cost of the entire search):
这给出了达到L(n)之后花费代价的概率上限即B（n，1/np）+（L（n）+NB（1，1–p）+1）–L（n）
结合我们的结果得到查找路径总长度的概率上界（即整个搜索的成本）：

total cost to climb out of a list of n elements
≤prob (L(n) – 1) + NB(L(n) – 1, p) + B(n, 1/np)+ NB(1, 1 – p) + 1
从n个元素列表中爬出来的总成本
≤prob（L（n）–1）+NB（L（n）–1，p）+  B（n，1/np）+NB（1，1–p）+1
爬了L（n）–1 + 爬L（n）–1中失败的次数 + 到达L（n）之后的代价

The expected value of our upper bound is equal to
(L(n) – 1) + (L(n) – 1)(1 – p)/p + 1/p + p/(1–p) + 1 = L(n)/p + 1/(1–p),
上界的期望值等于 (L(n) – 1) + (L(n) – 1)(1 – p)/p + 1/p + p/(1–p) + 1 = L(n)/p + 1/(1–p),

which is the same as our previously calculated upper bound on the expected cost of a search. The variance of our upper bound is
（L（n）–1）（1–p）/p的2次方+（1–1/np）/p+p/（1–p）的2次方 
<（1–p）L（n）/p的2次方+p/（1–p）的2次方+（2p–1）/p的2次方。
Figure 8 show a plot of an upper bound on the probability of an actual search taking substantially longer than average,
based on our probabilistic upper bound.

这与我们先前计算的搜索预期成本的上限相同。上界的方差是
（L（n）–1）（1–p）/p的2次方+（1–1/np）/p+p/（1–p）的2次方<（1–p）L（n）/p的2次方+p/（1–p）的2次方+（2p–1）/p的2次方。
(将上面的两种方差公式代入即可得到这个结果)
图8显示了实际查找时间远远超过平均时间的概率上限图，是基于我们的概率上限绘制的。
posted on 2021-02-02 20:50 子虚乌有阅读(152) 评论(0) 收藏举报