redis6.0.5之zset阅读笔记2--跳跃列表(zskiplist)之论文翻译1综述及定义和算法
Skip Lists: A Probabilistic Alternative to Balanced Trees ---William Pugh 跳跃列表: 一种平衡树的概率代替方案--作者 William Pugh Skip lists are a data structure that can be used in place of balanced trees. Skip lists use probabilistic balancing rather than strictly enforced balancing and as a result the algorithms for insertion and deletion in skip lists are much simpler and significantly faster than equivalent algorithms for balanced trees. 跳跃列表是一个数据结构,能够用来代替平衡树。跳跃列表使用概率平衡的方式,而非严格平衡的方式。 从结果看,相对于等价平衡树的(插入和删除)算法,这种(基于概率的平衡)算法在跳跃列表中插入和删除更加简单和明显更快。 Binary trees can be used for representing abstract data types such as dictionaries and ordered lists. They work well when the elements are inserted in a random order. Some sequences of operations, such as inserting the elements in order, produce degenerate data structures that give very poor performance. If it were possible to randomly permute the list of items to be in-serted, trees would work well with high probability for any input sequence. In most cases queries must be answered on-line, so randomly permuting the input is impractical. Balanced tree algorithms re-arrange the tree as operations are performed to maintain certain balance conditions and assure good perfor-mance. 二叉树可以用来表示抽象的数据类型,比如字典和有序列表。当元素被以随机顺序的插入时候它们工作的很好。 一些操作大的序列,比如顺序插入元素,会产生退化的数据结构,从而性能很差。 如果我们可以随机的置换列表中待插入的项目, 任何高随机的插入顺序会让二叉树将会工作的更好。在多数情况下,查询必须在线实时回答,这种情况下随机置换输入变得不切实际。 执行操作时平衡二叉树算法重置这棵树,使得树保持平衡的状态和确保高效。 Skip lists are a probabilistic alternative to balanced trees.Skip lists are balanced by consulting a random number generator. Although skip lists have bad worst-case performance,no input sequence consistently produces the worst-case performance (much like quicksort when the pivot element is chosen randomly). It is very unlikely a skip list data structure will be significantly unbalanced (e.g., for a dictionary of more than 250 elements, the chance that a search will take more than 3 times the expected time is less than one in a million). Skip lists have balance properties similar to that of search trees built by random insertions, yet do not require insertions to be random. 跳跃列表是一个平衡二叉树的概率替代方案。跳跃列表通过咨询一个随机数发生器保持平衡。 虽然跳跃列表有最坏情况的性能,但是没有输入序列会一直产生最坏的性能(就像快排随机选择中间支点元素) 这个是不可能的,即一个跳跃列表数据结构将会严重意义的不平衡(举例,对一个超过250个元素的字典,一个查找花费时间超过期望3倍的概率是百万分之一) (上面这句话表达的意思是 一个跳跃列表数据结构基本平衡,因为用了双重否定,其实是肯定的意思) 跳跃列表拥有的平衡特性类似于随机插入构建的查找树,而且不要求插入是随机的。 Balancing a data structure probabilistically is easier than explicitly maintaining the balance. For many applications, skip lists are a more natural representation than trees, also leading to simpler algorithms. The simplicity of skip list algorithms makes them easier to implement and provides significant constant factor speed improvements over balanced tree and self-adjusting tree algorithms. Skip lists are also very space efficient. They can easily be configured to require an average of 1 1/3 pointers per element (or even less) and do not require balance or priority information to be stored with each node. 用概率平衡一个数据结构比明确维持平衡(同样数据结构)容易。对很对应用, 跳跃列表是比树更加自然的表示,同时也导致更简单的算法。 跳跃列表简单的算法是的它们更加容易被实现而且 相对对于平衡树和自适应树算法,提供了改进的具有重要意义恒定因子速度。 跳跃列表空间利用也非常有效。能够容易配置成 每个元素需要一又三分之一个指针(甚至更少) 并且不需要在每个节点存储平衡或者优先信息。 SKIP LISTS 跳跃列表 We might need to examine every node of the list when searching a linked list (Figure 1a). If the list is stored in sorted order and every other node of the list also has a pointer to the node two ahead it in the list (Figure 1b), we have to examine no more than roundup[n/2]+ 1 nodes (where n is the length of the list). Also giving every fourth node a pointer four ahead (Figure 1c) requires that no more than roundup[n/4]+ 2 nodes be examined. If every (2i) th node has a pointer 2i nodes ahead (Figure 1d), the number of nodes that must be examined can be reduced tolog2 n while only doubling the number of pointers. This data structure could be used for fast searching, but insertion and deletion would be impractical. 当搜索一个链表时候,我们需要检查链表的每个节点(如图1a所示)。 如果链表是按照顺序存储的并且列表的另外一个节点 拥有一个指向链表中相距两个节点的指针(如图1b所示) 我们需要检查不超过 roundup[n/2]+ 1 个节点(n是列表的长度). 按照这个思路,再添加一条相距4个节点的链表(如图1c所示)只需要不超过roundup[n/4]+ 2 个节点的检查。 如果每第所在的节点指向前面相距2的i次举例的节点(如图1d所示), (i=0 就是2的0次方 等于1,就是 最图1a,i=1 就是2的1次方等于2,就是图1b减去1a,i=2 就是2的2次方等于4,就是图1c减去1b) 需要检查的元素个数就会被减少到n对2取对数的上整 指针的数目仅仅只是翻倍而已 A node that has k forward pointers is called a level k node.If every (2i) th node has a pointer 2i nodes ahead, then leve2的i次方ls of nodes are distributed in a simple pattern: 50% are level 1,25% are level 2, 12.5% are level 3 and so on. What would happen if the levels of nodes were chosen randomly, but in the same proportions (e.g., as in Figure 1e)? A node’s i th forward pointer, instead of pointing 2i–1 nodes ahead, points to the next node of level i or higher. Insertions or deletions would require only local modifications; the level of a node, chosen randomly when the node is inserted, need never change. Some arrangements of levels would give poor execution times, but we will see that such arrangements are rare. Because these data structures are linked lists with extra pointers that skip over intermediate nodes, I named them skip lists. 一个节点拥有k个前向指针被称为层级为k的节点。如果每第2的i次方所在的节点指向前面相距2的i次举例的节点, 那么节点的级别被分布在一个简单的模型中: 百分之50是层级1,百分之25是层级2,百分之12.5是层级3,依次往下。 在相同比例下,如果节点的层级是随机选择的,将会发生什么呢?(举例如图1e) 一个节点的第i个前向指针, 代替指向前面相隔2的i次方-1的节点,而是指向了下个层级为i或者更高的节点(就是间隔不是一个常数,而是随机的) 插入和删除只需要局部的修改即可。当新节点插入时,节点的层级可以随机选择,而且可以一直保持,无需改变。 一些这样安排的层指向时间会表现的不好,但是我们将看到这样的安排是很少的。 因为这些数据结构是带有额外跳过中间节点指针的链表,我把他们叫做跳跃链表(skip lists) -----------FIGURE 1 - Linked lists with additional pointers------------ a h->3->6->7->9->12->17->19->21->25->26->NULL b h---->6---->9----->17----->21----->26->NULL h->3->6->7->9->12->17->19->21->25->26->NULL c h---------->9------------->21--------->NULL h---->6---->9----->17----->21----->26->NULL h->3->6->7->9->12->17->19->21->25->26->NULL d h------------------------->21--------->NULL h---------->9------------->21--------->NULL h---->6---->9----->17----->21----->26->NULL h->3->6->7->9->12->17->19->21->25->26->NULL e h---->6------------------------------->NULL h---->6----------------------->25----->NULL h---->6---->9----->17--------->25----->NULL h->3->6->7->9->12->17->19->21->25->26->NULL -----------FIGURE 1 - Linked lists with additional pointers-------------- SKIP LIST ALGORITHMS 跳跃链表算法 This section gives algorithms to search for, insert and delete elements in a dictionary or symbol table. The Search operation returns the contents of the value associated with the desired key or failure if the key is not present. The Insert operation associates a specified key with a new value (inserting the key if it had not already been present). The Delete operation deletes the specified key. It is easy to support additional operations such as “find the minimum key” or “find the next key”. 这节给出在字典或者符号表中 搜索,插入,删除 的算法。 查找操作返回 目标键相关联值的内容 或者失败,如果没有找到目标键. 插入操作将指定键与一个新值相关联(如果跳跃链表中没有对应的键就插入键) 删除操作删除指定的键。 支持额外的操作,比如 查找最小键 或者 找下一个键 是容易的。 Each element is represented by a node, the level of which is chosen randomly when the node is inserted without regard for the number of elements in the data structure. A level i node has i forward pointers, indexed 1 through i. We do not need to store the level of a node in the node. Levels are capped at some appropriate constant MaxLevel. The level of a list is the maximum level currently in the list (or 1 if the list isempty). The header of a list has forward pointers at levels one through MaxLevel. The forward pointers of the header at levels higher than the current maximum level of the list point to NIL. 每一个元素由一个节点表示,当一个节点插入是,它的层级是随机选择的,不用考虑数据结构中的元素数量。 一个层级i节点拥有i个前向指针,索引从1到i。我们不需要在节点中存储一个节点的层级。 层级被限制在一个适当的常量MaxLevel。一个链表的层级是当前列表中最大的层级(所有节点层级中最大的值)(当为空链表时是1)。 链表的头部拥有在每个层级(即从1到MaxLevel)的前向指针。 头部前向的指针在高于当前链表最大层级的层次 都指向NULL. 如下图所示,假设MaxLevel为5,当前最大层级为3,那么头部的第4层和第5层的指针都指向NULL 5->NULL 4->NULL 3--------->13--------->NULL 2----->12----->14----->NULL 1->11->12->13->14->15->NULL Initialization 初始化 An element NIL is allocated and given a key greater than any legal key. All levels of all skip lists are terminated with NIL. A new list is initialized so that the the level of the list is equal to 1 and all forward pointers of the list’s header point to NIL. 一个空的指针被分配并且赋一个大于任何合法键的键。所有层级的所有跳跃链表都以NULL为结束 一个新的链表被初始,链表的层级就是1并且所有链表的前向指针指向NULL. Search Algorithm 查找算法 We search for an element by traversing forward pointers that do not overshoot the node containing the element being searched for (Figure 2). When no more progress can be made at the current level of forward pointers, the search moves down to the next level. When we can make no more progress at level 1, we must be immediately in front of the node that contains the desired element (if it is in the list). 我们通过遍历前向指针查找元素,不会超过包含当前搜索元素的节点。(如图2所示) 当没有更多的进展可以在当前层级的前向指针操作时,查找工作移动到下一个层级。 当我们在层级1没有办法取得进展时,我们立刻得到前面的节点就包含我们要查找的元素的节点(如果要查找元素存在链表中) ---------------FIGURE 2 - Skip list search algorithm-------------- Search(list, searchKey) x := list→header 指向头部 -- loop invariant: x→key < searchKey for i := list→level downto 1 do 向下遍历所有层级 while x→forward[i]→key < searchKey do 向前遍历当前层级,当大于要查找的键时,跳出循环,注意这里每层跳跃的距离不同 x := x→forward[i] 没有找到,向前前进i步 -- x→key < searchKey ≤ x→forward[1]→key x := x→forward[1] 向前1步就是要查找的节点 if x→key = searchKey then return x→value 如果存在就是这个元素 else return failure 不存在就返回失败 ---------------FIGURE 2 - Skip list search algorithm-------------- Insertion and Deletion Algorithms 插入和删除算法 To insert or delete a node, we simply search and splice, as shown in Figure 3. Figure 4 gives algorithms for insertion and deletion. A vector update is maintained so that when the search is complete (and we are ready to perform the splice), update[i] contains a pointer to the rightmost node of level i or higher that is to the left of the location of the insertion/deletion. If an insertion generates a node with a level greater than the previous maximum level of the list, we update the maximum level of the list and initialize the appropriate portions of the update vector. After each deletion, we check if we have deleted the maximum element of the list and if so, decreasethe maximum level of the list. 插入或者删除一个接地那,我们只需要查找和拼接,如图3所示。图4给出了插入和删除的算法。 一个向量update被维护,这样当查找结束(我们准备执行拼接)。 向量 update[i] 中包含一个指针 指向 层级i最右边的节点 或者更高层级的待插入/删除节点的左边. 如果一个插入产生了一个大于链表当前最大层级的的层级,我们需要更新链表的最大层级并且初始化向量update的适当部分。 经过每个删除操作,我们需要检查是否删除了链表的最大层级的元素,如果这样,减少链表的最大层级。 ------------------FIGURE 3 - Pictorial description of steps involved in performing an insertion---------------------------- serach path(标*的表示经过的路径,从上到下,从左往右,i表示待插入新节点的位置) h---->6*------------------------------>NULL h---->6*---------------------->25----->NULL h---->6*--->9*---------------->25----->NULL h->3->6->7->9*->12*-i-->19->21->25->26->NULL update[i]→forward[i] *************original list, 17 to be inserted*********** h---->6------------------------------->NULL h---->6----------------------->25----->NULL h---->6---->9=====>17=========>25----->NULL h->3->6->7->9->12=>17=>19->21->25->26->NULL *************list after insertion, updated pointers in grey(用等号代替灰色)************* ------------------FIGURE 3 - Pictorial description of steps involved in performing an insertion---------------------------- -----------------FIGURE 4 - Skip List insertion and deletion algorithms----------------------------- Insert(list, searchKey, newValue) local update[1..MaxLevel] 定义向量数组 x := list→header 指向头部 for i := list→level downto 1 do 从上到下遍历每一层 while x→forward[i]→key < searchKey do 查找searchKey的位置,注意这里每层跳跃的距离不同 x := x→forward[i] 没有找到,向前前进i步 -- x→key < searchKey ≤ x→forward[i]→key update[i] := x 退出循环表示找到了位置,让向量里第i的位置保存指向的节点 x := x→forward[1] 向前1步就是要查找的节点 if x→key = searchKey then x→value := newValue 链表中原本就已经存在,更新值就可以 else 不存在,需要新增 lvl := randomLevel() 随机获取新节点的层级 if lvl > list→level then 如果层级大于列表的层级,那么 for i := list→level + 1 to lvl do 新增高层级的链表 update[i] := list→header list→level := lvl 更新链表的层级 x := makeNode(lvl, searchKey, value) 创建新节点 for i := 1 to level do 从 1层 到 新节点的层级,更新前后的指针指向 x→forward[i] := update[i]→forward[i] 新节点的前节点是 原来节点的前节点 update[i]→forward[i] := x 原来节点的前节点 变成 新节点 Delete(list, searchKey) local update[1..MaxLevel] 初始化向量 x := list→header 指向头部 for i := list→level downto 1 do 从上往下遍历 while x→forward[i]→key < searchKey do 在每层查找指定键 x := x→forward[i] 没有找到,向前前进i步 update[i] := x 退出循环,表示找到了,让向量里第i的位置保存指向的节点 x := x→forward[1] 向前1步就是要找的节点 if x→key = searchKey then 如果就是要找的节点 for i := 1 to list→level do 从层级1开始到 列表最高层级 if update[i]→forward[i] ≠ x then break 如果层级i没有指向x的指针,说明已经到该节点的最高层级,停止 update[i]→forward[i] := x→forward[i] 找到节点后面的指针 指向 找到节点 前面的指针,即去除找到节点 free(x) 释放找到节点 while list→level > 1 and list→header→forward[list→level] = NIL do 层级高于1层并且最高层级被删除了 list→level := list→level – 1 那么链表的层级需要减少1层 -----------------FIGURE 4 - Skip List insertion and deletion algorithms----------------------------- Choosing a Random Level 选择一个随机层 Initially, we discussed a probability distribution where half of the nodes that have level i pointers also have level i+1 pointers. To get away from magic constants, we say that a fraction p of the nodes with level i pointers also have level i+1 pointers. (for our original discussion, p = 1/2). Levels are generated randomly by an algorithm equivalent to the one in Figure 5. Levels are generated without reference to the number of elements in the list. 最初,我们讨论了一个概率分布,一半的节点拥有层级i的指针,也有i+1指针(这个就是最初的2的i次方的分布) 为了避免魔术常量,我们说一个分数p的节点(总节点 * p ) 拥有层级i的指针,也有i+1指针。 (对我们最初的讨论,p=1/2). 层级是通过一个等价于图5的算法随机产生的。产生层级没有和链表的节点数有关联 -----------------------FIGURE 5 - Algorithm to calculate a random level--------------------------------------- randomLevel() lvl := 1 初始为1层,就是所有节点至少有一层 -- random() that returns a random value in [0...1) 函数random返回一个0到1之间的随机值,0是闭区间,1是开区间,即0可取到,1不能 while random() < p and lvl < MaxLevel do 当随机数小于p 并且 层级小于常量最大层级 lvl := lvl + 1 层级+1 return lvl 返回最后的层级 -----------------------FIGURE 5 - Algorithm to calculate a random level--------------------------------------- At what level do we start a search? Defining L(n) 在哪个层级开始查找? 定义L(n) In a skip list of 16 elements generated with p = 1/2, we might happen to have 9 elements of level 1, 3 elements of level 2, 3 elements of level 3 and 1 element of level 14 (this would be very unlikely, but it could happen). How should we handle this? If we use the standard algorithm and start our search at level 14, we will do a lot of useless work. 当p = 1/2时,产生一个16个元素的跳跃列表过程中,我们可能会碰到9个元素的层级是1,3个元素的层级是2, 3个元素的层级是3,1个元素的层级是14(这个看上去不太可能,但是却是可能发生). 我们该如何处理这个问题呢? 如果我们使用标准的算法,从第14层开始查找,我们将做很多无用功。 Where should we start the search? Our analysis suggests that ideally we would start a search at the level L where we expect 1/p nodes. This happens when L = log1/p(n). Since we will be referring frequently to this formula, we will use L(n) to denote log1/p(n). 我们应该从哪里开始查找呢?我们的分析建议理想的情况我们应该从具有1/p个节点的层级L处开始查找 这个发生在L = log1/p(n),因为我们会频繁的引用这个公式,所以我们使用L(n)来表示log1/p(n)
(这里直译似乎不是很好看懂,我们试着给出自己的解释,对于一般正常情况,是从理论上最高层开始向下遍历 那么理论上的最高层是所少呢?最高层一般只会存在一个节点,假设最高层为k,概率为p,总节点数为n, 那么就可以得到 n * p的k次方 = 1,就可以求出 k = log1/p(n), 即是本文中的L值 ) There are a number of solutions to the problem of deciding how to handle the case where there is an element with an unusually large level in the list. 针对如何处理链表中一个元素拥有不寻常大的层级(就是超级大的层级)的元素的问题,这里还存在多个解决方案 •Don’t worry, be happy. Simply start a search at the highest level present in the list. As we will see in our analysis, the probability that the maximum level in a list of n elements is significantly larger than L(n) is very small. Starting a search at the maximum level in the list does not add more than a small constant to the expected search time. This is the approach used in the algorithms described in this paper. 不要焦虑,放轻松。只要在链表最高层级开始查找。就如我们将要看到的分析, 拥有n个元素的链表中最大层级比L(n)大很多的概率是非常小的。 从链表最大层级开始查找不会对于其查找时间增加超过一个小常量。就个(从最高层级开始查找)就是本文中描述的算法 •Use less than you are given. Although an element may contain room for 14 pointers, we don’t need to use all 14. We can choose to utilize only L(n) levels. There are a number of ways to implement this, but they all complicate the algorithms and do not noticeably improve performance, so this approach is not recommended. 只使用少部分给的资源。虽然一个元素可能包含14个指针的空间,我们实际上不需要使用全部14个指针空间。 我们能选择只使用L(n)个层。有很多种方法可以实现这个想法, 但是他们都会增加算法的复杂度并且不会明显改进性能,所以这个想法不被推荐 •Fix the dice. If we generate a random level that is more than one greater than the current maximum level in the list, we simply use one plus the current maximum level in the list as the level of the new node. In practice and intuitively, this change seems to work well. However, it totally destroys our ability to analyze the resulting algorithms, since the level of a node is no longer completely random. Programmers should probably feel free to implement this, purists should avoid it. 确定随机值。如果我们产生一个随机层级,比当前链表中最大层级大1以上,我们值需要将当前链表中最大层级加上1当做新节点的层级。 在实际和直观中,这个改变看上去工作的很好。然而它完全破坏了我们分析结果算法的可能性。 因为一个节点的层级不再是完全随机的了。程序员应该可以轻松实现这一点,但是理论分析者应该避免这点。 Determining MaxLevel 确定最大常量值 Since we can safely cap levels at L(n), we should choose MaxLevel = L(N) (where N is an upper bound on the number of elements in a skip list). If p = 1/2, using MaxLevel = 16 is appropriate for data structures containing up to 216 elements. 自从我们能够安全的限制层级在L(n),我们应该选择最大层级为L(N)(N是跳跃列表中元素个数的上限) 如果p = 1/2,使用MaxLevel = 16 就可以容纳包含最多2的16次方的元素的数据结构。
我们应该从哪里开始查找呢?我们的分析建议理想的情况我们应该从具有1/p个节点的层级L处开始查找这个发生在L = log1/p(n),因为我们会频繁的引用这个公式,所以我们使用L(n)来表示log1/p(n)(这里直译似乎不是很好看懂,我们试着给出自己的解释,对于一般正常情况,是从理论上最高层开始向下遍历那么理论上的最高层是所少呢?最高层一般只会存在一个节点,假设最高层为k,概率为p,总节点数为n,那么就可以得到 n * p的k次方 = 1,就可以求出 k = log1/p(n), 即是本文中的L值 )
浙公网安备 33010602011771号