Redis ZDIFF命令的算法

介绍

Redis 6.2.0 引入了ZDIFF命令ZDIFFSTORE命令, 这两个命令都能够做集合差运算

ZDIFF的运算结果直接在Redis client中显示
ZDIFFSTORE命令将运算结果以Redis ZSet的编码形式存到ZDIFFSTORE第一个参数中, 并返回一个int表示 运算结果中包含的元素数量

例如 Redis命令ZDIFF 3 zset1 zset2 zset3 等价于 (zset1 - zset2) - zset3
  其中 -表示集合差运算
  其中 zset 是 Redis Set或者Redis Zset 类型

意义: 这篇文章介绍Redis如何实现ZDIFF命令, 视角集中在数据结构角度, 参考的源码是Redis6.2的源码t_zset.c, 可以节省翻看源码的时间, 也可以用来参考如何手动实现zdiff类似命令

记号

set1 ZDIFF命令的第二个参数对应的数据结构
set1不是指Redis数据库中的key, 而是*(key->ptr), 具体的对象
set2, set3,... ZDIFF命令作用的其他非set1的集合对应的数据结构
dstzset 运算结果, 类型是Redis Zset
N set1的大小
M 参与运算的集合的总数
K dstzset包含的元素数量
L set1的大小 加上set2的大小 加上set3的大小 加上 ...

后面几个记号其实是源码自己用的记号
下文方便起见, 均采用平均时间复杂度, 也就是跳表查询时间为\(O(lgN)\)

源码总览

代码调用关系如下所示, 其中无关主题的部分已经被略去
从源码找出算法的入口, 可以发现由zdiff函数控制
下文先介绍zdiffAlgorithm1zdiffAlgorithm2再介绍zsetChooseDiffAlgorithm

/*
 * Copyright (c) 2009-2012, Salvatore Sanfilippo <antirez at gmail dot com>
 * Copyright (c) 2009-2012, Pieter Noordhuis <pcnoordhuis at gmail dot com>
 * All rights reserved.
*/
static void zdiff(zsetopsrc *src, long setnum, zset *dstzset, size_t *maxelelen) {
    /* Skip everything if the smallest input is empty. */
    if (zuiLength(&src[0]) > 0) {
        int diff_algo = zsetChooseDiffAlgorithm(src, setnum);
        if (diff_algo == 1) {
            zdiffAlgorithm1(src, setnum, dstzset, maxelelen);
        } else if (diff_algo == 2) {
            zdiffAlgorithm2(src, setnum, dstzset, maxelelen);
        } else if (diff_algo != 0) {
            serverPanic("Unknown algorithm");
        }
    }
}
/*...*/
void zunionInterDiffGenericCommand(client *c, robj *dstkey, int numkeysIndex, int op) {
    // ...
    else if (op == SET_OP_DIFF) {
        //src包含了所有参加运算的集合, 类型是 `zsetopsrc[]`
        //dstzset是一个Redis zset用来存放运算结果
        //maxelement只是用来在后面的代码中把zset转成ziplist(如果需要), 可以理解为返回值的一种, c语言无引用类型, 也不能返回多个值
        zdiff(src, setnum, dstzset, &maxelelen);
    }
    // ...
            zsetConvertToZiplistIfNeeded(dstobj, maxelelen);
    // ...

}
/*...*/
void zdiffCommand(client *c) {
    zunionInterDiffGenericCommand(c, NULL, 1, SET_OP_DIFF);
}
/*...*/

zdiffAlgorithm1

大概步骤

枚举set1中所有元素, 每一个元素判断是否在set2,set3,...中; 如果都不在, 才添加到dstzset中
判断在不在dict里面时间复杂度是\(O(lgN)\), 插入dstzset平均时间复杂度\(O(lgN)\)

具体实现

set1底层实现可能是INTSET或者HASHTABLE或者ZIPLIST或者ENCODING_SKIPLIST(skiplist+hashtable)
因此需要实现一个通用的接口作为遍历方法

下面代码中的zuiInitIteratorzuiNext函数被用于遍历set1
当set1也就是代码中的&src[0], 为ENCODING_SKIPLIST类型时, zuiInitIterator大概机制是src[0]中存一个指针, 指向src[0].subject->zsl->tail(也就是ENCODING_SKIPLIST内部的zskiplist的尾部结点), 这个指针作为iteartor, 但是有个缺点, 不能同时进行两次遍历, 因为iterator只有一个, 所以在代码中还需要额外判断set1是否等同于set2,set3...

阅读源码

/*
 * Copyright (c) 2009-2012, Salvatore Sanfilippo <antirez at gmail dot com>
 * Copyright (c) 2009-2012, Pieter Noordhuis <pcnoordhuis at gmail dot com>
 * All rights reserved.
*/
static void zdiffAlgorithm1(zsetopsrc *src, long setnum, zset *dstzset, size_t *maxelelen) {
    /* DIFF Algorithm 1:
     *
     * We perform the diff by iterating all the elements of the first set,
     * and only adding it to the target set if the element does not exist
     * into all the other sets.
     *
     * This way we perform at max N*M operations, where N is the size of
     * the first set, and M the number of sets.
     *
     * There is also a O(K*log(K)) cost for adding the resulting elements
     * to the target set, where K is the final size of the target set.
     *
     * The final complexity of this algorithm is O(N*M + K*log(K)). */
    int j;
    zsetopval zval;
    zskiplistNode *znode;
    sds tmp;

    /* With algorithm 1 it is better to order the sets to subtract
     * by decreasing size, so that we are more likely to find
     * duplicated elements ASAP. */
    // 中文说明: 该算法一个细节, 对set2, set3, ... 按大小降序排序, 因为一个元素更有可能在更大的集合中
    qsort(src+1,setnum-1,sizeof(zsetopsrc),zuiCompareByRevCardinality);

    memset(&zval, 0, sizeof(zval));
    zuiInitIterator(&src[0]);
    while (zuiNext(&src[0],&zval)) {
        double value;
        int exists = 0;

        for (j = 1; j < setnum; j++) {
            /* It is not safe to access the zset we are
             * iterating, so explicitly check for equal object.
             * This check isn't really needed anymore since we already
             * check for a duplicate set in the zsetChooseDiffAlgorithm
             * function, but we're leaving it for future-proofing. */
            if (src[j].subject == src[0].subject ||
                zuiFind(&src[j],&zval,&value)) {
                exists = 1;
                break;
            }
        }

        if (!exists) {
            tmp = zuiNewSdsFromValue(&zval);
            znode = zslInsert(dstzset->zsl,zval.score,tmp);
            dictAdd(dstzset->dict,tmp,&znode->score);
            if (sdslen(tmp) > *maxelelen) *maxelelen = sdslen(tmp);
        }
    }
    zuiClearIterator(&src[0]);
}

时间复杂度

\(O(NMlg(L))\)

zdiffAlgorithm2

大概步骤

把set1拷贝一份到dstzset中, 然后依次遍历 set2,set3,... 中的所有元素, 一共(L-N)个, 当某个元素在dstzset中, 则在dstzset中查找并删除该元素

阅读源码

/*
 * Copyright (c) 2009-2012, Salvatore Sanfilippo <antirez at gmail dot com>
 * Copyright (c) 2009-2012, Pieter Noordhuis <pcnoordhuis at gmail dot com>
 * All rights reserved.
*/
static void zdiffAlgorithm2(zsetopsrc *src, long setnum, zset *dstzset, size_t *maxelelen) {
    /* DIFF Algorithm 2:
     *
     * Add all the elements of the first set to the auxiliary set.
     * Then remove all the elements of all the next sets from it.
     *

     * This is O(L + (N-K)log(N)) where L is the sum of all the elements in every
     * set, N is the size of the first set, and K is the size of the result set.
     *
     * Note that from the (L-N) dict searches, (N-K) got to the zsetRemoveFromSkiplist
     * which costs log(N)
     *
     * There is also a O(K) cost at the end for finding the largest element
     * size, but this doesn't change the algorithm complexity since K < L, and
     * O(2L) is the same as O(L). */
    int j;
    int cardinality = 0;
    zsetopval zval;
    zskiplistNode *znode;
    sds tmp;

    for (j = 0; j < setnum; j++) {
        if (zuiLength(&src[j]) == 0) continue;

        memset(&zval, 0, sizeof(zval));
        zuiInitIterator(&src[j]);
        while (zuiNext(&src[j],&zval)) {
            if (j == 0) {
                tmp = zuiNewSdsFromValue(&zval);
                znode = zslInsert(dstzset->zsl,zval.score,tmp);
                dictAdd(dstzset->dict,tmp,&znode->score);
                cardinality++;
            } else {
                tmp = zuiSdsFromValue(&zval);
                if (zsetRemoveFromSkiplist(dstzset, tmp)) {
                    cardinality--;
                }
            }

            /* Exit if result set is empty as any additional removal
                * of elements will have no effect. */
            if (cardinality == 0) break;
        }
        zuiClearIterator(&src[j]);

        if (cardinality == 0) break;
    }

    /* Redize dict if needed after removing multiple elements */
    if (htNeedsResize(dstzset->dict)) dictResize(dstzset->dict);

    /* Using this algorithm, we can't calculate the max element as we go,
     * we have to iterate through all elements to find the max one after. */
    *maxelelen = zsetDictGetMaxElementLength(dstzset->dict);
}

时间复杂度

\(O(Llg(N))\)

zsetChooseDiffAlgorithm

算法zdiffAlgorithm1zdiffAlgorithm2的主要运算量分别体现在MNL上, 因此只要比较这两个值就能估算出哪个算法更快, 直观上就是, set1元素数量比set2,set3,...的平均元素数量更少, 就选择zdiffAlgorithm1, 否则选择zdiffAlgorithm2

此外, zdiffAlgorithm1算法中, 对于一个set1中的元素, 只要出现在set2中, 就知道不用添加到dstzset中, 就可以不用继续判断set3和set4,...
因此, 源码编写者认为zdiffAlgorithm1的时间复杂度的常数部分更小, 应该比较MN/2L

还有一个可以优化的地方是, 如果set1也在set2,set3,...中, 那么可以立刻得出dstzset为空集合

阅读源码

/*
 * Copyright (c) 2009-2012, Salvatore Sanfilippo <antirez at gmail dot com>
 * Copyright (c) 2009-2012, Pieter Noordhuis <pcnoordhuis at gmail dot com>
 * All rights reserved.
*/
static int zsetChooseDiffAlgorithm(zsetopsrc *src, long setnum) {
    int j;

    /* Select what DIFF algorithm to use.
     *
     * Algorithm 1 is O(N*M + K*log(K)) where N is the size of the
     * first set, M the total number of sets, and K is the size of the
     * result set.
     *
     * Algorithm 2 is O(L + (N-K)log(N)) where L is the total number of elements
     * in all the sets, N is the size of the first set, and K is the size of the
     * result set.
     *
     * We compute what is the best bet with the current input here. */
    long long algo_one_work = 0;
    long long algo_two_work = 0;

    for (j = 0; j < setnum; j++) {
        /* If any other set is equal to the first set, there is nothing to be
         * done, since we would remove all elements anyway. */
        if (j > 0 && src[0].subject == src[j].subject) {
            return 0;
        }

        algo_one_work += zuiLength(&src[0]);
        algo_two_work += zuiLength(&src[j]);
    }

    /* Algorithm 1 has better constant times and performs less operations
     * if there are elements in common. Give it some advantage. */
    algo_one_work /= 2;
    return (algo_one_work <= algo_two_work) ? 1 : 2;
}

posted @ 2021-05-29 22:46  migeater  阅读(745)  评论(0编辑  收藏  举报