redis6.0.5之dict阅读笔记5-dict之随机返回元素和元素组

redis6.0.5之dict阅读笔记5-dict之随机返回元素和元素组
这些函数主要是为了采样需要，做不精确的统计
******************************************************************
/* Return a random entry from the hash table. Useful to
 * implement randomized algorithms */
这个函数返回一个随机的从hash表中获取的元素，用来实现随机算法
dictEntry *dictGetRandomKey(dict *d)
{
    dictEntry *he, *orighe;
    unsigned long h;
    int listlen, listele;

    if (dictSize(d) == 0) return NULL;  //如果元素为空，那么就返回空
    if (dictIsRehashing(d)) _dictRehashStep(d); //如果正在做平滑迁移，那么就做最多一桶的迁移
    if (dictIsRehashing(d)) {  //在迁移过程中随机获取元素
        do {
            /* We are sure there are no elements in indexes from 0
             * to rehashidx-1 */
             //因为迁移，所以从桶0到rehashidx-1时没有元素的
            h = d->rehashidx + (random() % (d->ht[0].size +
                                            d->ht[1].size -
                                            d->rehashidx));
random()获取一个long int的随机数，然后对剩余没有迁移的桶数取模，再加上已经迁移的个数d->rehashidx，
这个值就是从没有迁移过的table0和迁移过去的table1中随机获取元素                            
            he = (h >= d->ht[0].size) ? d->ht[1].table[h - d->ht[0].size] :
                                      d->ht[0].table[h];
如果上一步得到的值大于table0的数，那么就要从table1中获取元素，否则就从table0中获取元素，
如果获取的桶的是空桶，那么久继续尝试，知道找到有元素的桶为止                                
        } while(he == NULL);
    } else {  //没有做平滑迁移，就只需从table0中获取元素即可
        do {
            h = random() & d->ht[0].sizemask;  //随机获取要取的桶
            he = d->ht[0].table[h]; //拿到桶里的第一个元素
        } while(he == NULL); //元素为空的情况下，继续下一次随机获取，知道元素不为空为止
    }

    /* Now we found a non empty bucket, but it is a linked
     * list and we need to get a random element from the list.
     * The only sane way to do so is counting the elements and
     * select a random index. */
我们已经得到了非空的桶，但是它是一个链表，我们现在需要从列表中获取一个随机元素。
唯一的方式是对列表中的元素计数，然后从中随机获取一个。
    listlen = 0; //对列表中的元素计数
    orighe = he; //留好开始位置，为后面第二次取数服务
    while(he) {  //遍历列表，获取元素数量
        he = he->next;
        listlen++;
    }
    listele = random() % listlen; //获取随机位置值
    
    he = orighe;  //又回到头部
    while(listele--) he = he->next; //从头开始查找，获取随机位置的值
    return he;
}
******************************************************************
/* This function samples the dictionary to return a few keys from random
 * locations.
这个函数从字典中返回一些随机获取的元素
 * It does not guarantee to return all the keys specified in 'count', nor
 * it does guarantee to return non-duplicated elements, however it will make
 * some effort to do both things.
它不能保证返回的元素一定是刚好我们需要的元素个数(可能会少)，也不能保证不返回一样的元素(即存在一个元素多次出现的情况)，（）
然而这个函数会尽量做到这两件事情(1.刚好返回我们需要的元素个数 2.返回不同的元素)
 * Returned pointers to hash table entries are stored into 'des' that
 * points to an array of dictEntry pointers. The array must have room for
 * at least 'count' elements, that is the argument we pass to the function
 * to tell how many random elements we need.
返回的指向hash表元素的指针存储在des里面，这是一个在一个指向元素数组的指针。
数组必须拥有至少count元素个空间，这个数组就是我们传递给函数告诉它我们需要返回多少个随机元素
 * The function returns the number of items stored into 'des', that may
 * be less than 'count' if the hash table has less than 'count' elements
 * inside, or if not enough elements were found in a reasonable amount of
 * steps.
这个函数返回的存储在des中的元素可能会比我们期望的count个少，或者因为hash表总的元素就只有少于count个，
或者在给定的步骤中找不到那么多的元素(为了节约时间不阻塞其它操作)
 * Note that this function is not suitable when you need a good distribution
 * of the returned items, but only when you need to "sample" a given number
 * of continuous elements to run some kind of algorithm or to produce
 * statistics. However the function is much faster than dictGetRandomKey()
 * at producing N elements. */
注意这个函数不合适获取一个好的分布的返回值，只适合用来采样给定连续元素运行的某种算法或者产生统计数据。
然而这个函数比用dictGetRandomKey()获取N个元素快很多！
unsigned int dictGetSomeKeys(dict *d, dictEntry **des, unsigned int count) {
    unsigned long j; /* internal hash table id, 0 or 1. */
    unsigned long tables; /* 1 or 2 tables? */
    unsigned long stored = 0, maxsizemask;
    unsigned long maxsteps;

    if (dictSize(d) < count) count = dictSize(d); 
    //如果字典的总数量小于我们要返回的个数，将返回个数的目标降低到总数量
    
    maxsteps = count*10;  //最多寻找期待返回元素10倍数量的步数

    /* Try to do a rehashing work proportional to 'count'. */
    根据返回元素的比例做平滑迁移步数
    for (j = 0; j < count; j++) {
        if (dictIsRehashing(d))
            _dictRehashStep(d);
        else
            break;
    }

    tables = dictIsRehashing(d) ? 2 : 1; //正在做平滑迁移，那就需要操作两张表
    
    maxsizemask = d->ht[0].sizemask;
    if (tables > 1 && maxsizemask < d->ht[1].sizemask)
        maxsizemask = d->ht[1].sizemask;
    //获取需要与的最大值(一张表就是table0的值，两张的表就取其中大的一张值)

    /* Pick a random point inside the larger table. */
    在大表内部获取一个随机点
    unsigned long i = random() & maxsizemask;
    unsigned long emptylen = 0; /* Continuous empty entries so far. */
    对遍历过程中的连续空元素计数
    while(stored < count && maxsteps--) {
        for (j = 0; j < tables; j++) {
            /* Invariant of the dict.c rehashing: up to the indexes already
             * visited in ht[0] during the rehashing, there are no populated
             * buckets, so we can skip ht[0] for indexes between 0 and idx-1. */
迁移过程中的原理，因为从table0中的0号桶到idx-1号桶都是空的，所以我们可以跳过，直接从idx桶开始
            if (tables == 2 && j == 0 && i < (unsigned long) d->rehashidx) {
tables == 2表示在迁移， j == 0表示操作的是第一张表即table0,
                /* Moreover, if we are currently out of range in the second
                 * table, there will be no elements in both tables up to
                 * the current rehashing index, so we jump if possible.
                 * (this happens when going from big to small table). */
更进一步，在上述条件下，如果我们超过了第二张表的大小，那么这个索引在这两张表中就没有元素了。
所以我们跳转，将i设置为d->rehashidx，这样第一张表中是可能存在元素的(这样的情况发生在从大表到小表的转移过程中)
                if (i >= d->ht[1].size) 
                //这种情况会导致第一张表和第二张表对应这个索引的位置都是空元素，所以重置。
                    i = d->rehashidx;
                else   //在这种情况下，跳过第一张表(因为么有元素),即我们从第二张表中找元素
                    continue;
            }
            if (i >= d->ht[j].size) continue; /* Out of range for this table. */
            //如果超过对应表的最大个数，就跳过这张表，因为肯定找不到元素
            
            dictEntry *he = d->ht[j].table[i]; //将找到的可能存在元素的桶赋值

            /* Count contiguous empty buckets, and jump to other
             * locations if they reach 'count' (with a minimum of 5). */
             //如果连续的设定值个桶没有元素或者至少5个连续的桶没有元素(当设定值小于5时)，就尝试跳转到另外的地方查找
            if (he == NULL) {
                emptylen++; //如果随机找到的元素为空
                if (emptylen >= 5 && emptylen > count) { //当连续查找的空元素超过5个并且大于设置的值
                    i = random() & maxsizemask; //重新定位随机索引
                    emptylen = 0; //将连续🔚空值设置为0
                }
            } else {
                emptylen = 0; //出现值，将连续设置的空值计数设置为0
                while (he) {
                    /* Collect all the elements of the buckets found non
                     * empty while iterating. */
                     将这个桶里的所有元素保存起来，供返回
                    *des = he;
                    des++;
                    he = he->next;
                    stored++;
                    if (stored == count) return stored;  //如果已经满足返回的个数，就返回查找结果
                }
            }
        }
        i = (i+1) & maxsizemask; //继续查找连续的下一个桶
    }
    return stored; //耗尽给定的步骤也找不到的情况下，也需要返回结果,有多少是多少
}
******************************************************************
/* This is like dictGetRandomKey() from the POV of the API, but will do more
 * work to ensure a better distribution of the returned element.
这个函数从API的视觉来看像极了函数dictGetRandomKey，但是做了更多的事情来保证返回的值又一个更合理的分布。
 * This function improves the distribution because the dictGetRandomKey()
 * problem is that it selects a random bucket, then it selects a random
 * element from the chain in the bucket. However elements being in different
 * chain lengths will have different probabilities of being reported. With
 * this function instead what we do is to consider a "linear" range of the table
 * that may be constituted of N buckets with chains of different lengths
 * appearing one after the other. Then we report a random element in the range.
 * In this way we smooth away the problem of different chain lenghts. */
这个函数改进了分布，因为函数dictGetRandomKey的问题是它随机选择了一个桶，然后在这个桶所在链中随机选择一个元素。
对于这个方案的代替，我们对表使用了线性范围，就是将所有的桶中的元素按照桶的位置一个一个的连接在一起组成一个线性序列。
这样我们再从中随机获取一个元素.通过这种方式我们平滑了所有链不同长度分布.即所有桶的元素全部在一个线性列表上
#define GETFAIR_NUM_ENTRIES 15
dictEntry *dictGetFairRandomKey(dict *d) {
    dictEntry *entries[GETFAIR_NUM_ENTRIES];  //初始化15个元素数组
    unsigned int count = dictGetSomeKeys(d,entries,GETFAIR_NUM_ENTRIES); //随机线性获取15个元素
    /* Note that dictGetSomeKeys() may return zero elements in an unlucky
     * run() even if there are actually elements inside the hash table. So
     * when we get zero, we call the true dictGetRandomKey() that will always
     * yeld the element if the hash table has at least one. */
注意到函数dictGetSomeKeys在特别不幸的情况下可能返回0个元素 ，即使字典表中的确实存在元素。
所以当我们得到0个元素时，我们调用dictGetRandomKey获取元素，只要这个字典至少有一个元素。
    if (count == 0) return dictGetRandomKey(d);
    unsigned int idx = rand() % count; //从返回的元素中随机获取一个元素
    return entries[idx];
}
******************************************************************
posted on 2020-08-17 17:16 子虚乌有阅读(253) 评论(0) 收藏举报