百万数量级的2个集合差异性对比的思考

　　最近在项目中遇到这样一个问题，对百万级的数据进行一个比对，大致有2个思路，一，将2个集合排序，将对象中需的属性取出拼接成字符串，然后对凭借的字符串进行摘要，最后对比2个集合的摘要值，二、重写对象中hash和equal，直接对比2个集合，比较其不同。对于2个方案，做了一些对比。对比的主要点集中在耗时，内存的占用，gc回收的次数

　　一、两种比较方式的代码实现

摘要比对，这里采用的是sha-1的方式进行摘要

 /**
     * 生成摘要
     * @param content
     * @return
     */
    public static String getMessageDigest(String content) {

        MessageDigest messageDigest;
        StringBuffer sb =new StringBuffer();
        try {
            long now = System.currentTimeMillis();
            messageDigest = MessageDigest.getInstance("SHA-1");
            messageDigest.update(content.getBytes("utf-8"));
            byte[] hash = messageDigest.digest();
            for(int i = 0; i < hash.length; i++ ){
                int v = hash[i] & 0xFF;
                if(v < 16) {
                    sb.append("0");
                }
                sb.append(Integer.toString(v,16).toUpperCase());
            }
        } catch (NoSuchAlgorithmException e) {
            sb.append("生成摘要异常").append(e.getMessage());
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            sb.append("生成摘要异常").append(e.getMessage());
            e.printStackTrace();
        }

        return sb.toString();
    }

全量比对，这里若采用jdk自带的几种直接对集合操作的api的话，如removeall，containsAll等等，追踪其源码，都是采用双重for循环来实现比较的，时间复杂度均为O(N²)，所以我们采用hashmap作为媒介，采取一种时间复杂度为0(N)的方式来比较，代码如下，

/**
     *  获取两个集合的不同元素
     * @param collmax
     * @param collmin
     * @return
     */
    @SuppressWarnings({ "rawtypes", "unchecked" })
    public static Collection getDiffent(Collection collmax,Collection collmin)
    {
        //使用LinkeList防止差异过大时,元素拷贝
        Collection csReturn = new LinkedList();
        Collection max = collmax;
        Collection min = collmin;
        //先比较大小,这样会减少后续map的if判断次数
        if(collmax.size()<collmin.size())
        {
            max = collmin;
            min = collmax;
        }
        //直接指定大小,防止再散列
        Map<Object,Integer> map = new HashMap<Object,Integer>(max.size());
        for (Object object : max) {
            map.put(object, 1);
        }
        for (Object object : min) {
            if(map.get(object)==null)
            {
                csReturn.add(object);
            }else{
                map.put(object, 2);
            }
        }
        for (Map.Entry<Object, Integer> entry : map.entrySet()) {
            if(entry.getValue()==1)
            {
                csReturn.add(entry.getKey());
            }
        }
        return csReturn;
    }

二、验证，

上面介绍了两种方式来比对list，下面来做个测试

package com.example.demo;

import com.example.demo.util.CollectionUtil;
import org.springframework.boot.SpringApplication;
import org.springframework.util.StringUtils;

import java.util.*;
import java.util.stream.Collectors;

/**
 * 描述: 测试2个大的集合数据的对比
 *
 * @author liuyao
 * @create 2018-12-18 14:14
 */
public class CompareTest {
    public static void main(String[] args) {
　　　　
        List<String> list1 = createLiet(1000000);
        List<String> list2 = createLiet2(1000000);
        System.out.println("开始对比---------");

        long nowTime2 = System.currentTimeMillis();
        Collection aa = CollectionUtil.getDiffent(list1,list2);
        System.out.println(aa);
        System.out.println("全量比对耗时"+(System.currentTimeMillis()-nowTime2));
        long nowTime1 = System.currentTimeMillis();
        String list1Str = CollectionUtil.getMessageDigest(StringUtils.collectionToDelimitedString(list1, ","));
        String list2Str = CollectionUtil.getMessageDigest(StringUtils.collectionToDelimitedString(list2,","));
        if (list1Str.equals(list2Str)){
            System.out.println("2list相同");
        }
        System.out.println("摘要耗时"+(System.currentTimeMillis()-nowTime1));
    }

    private static List<String> createLiet2(int count) {
        Set<String> set = new HashSet<>(count);
        for (int i=10;i<count+10;i++) {
            set.add(new String(i+"测试数据abc"));
        }

        return new ArrayList<>(set);
    }

    private static List<String> createLiet(int count) {
        Set<String> set = new HashSet<>(count);
        for (int i=0;i<count;i++) {
            set.add(new String(i+"测试数据abc"));
        }

        return new ArrayList<>(set);
    }

}

由于数据量较大，gc回收也会影响较大的，加入jvm参数-XX:+PrintGCDetails以对比gc情况

执行结果如下

可以看出全量对比是比摘要对比要耗时的，但是我们再看下gc日志的情况，在全量比对期间，发生了一次full gc，耗时683ms，full gc期间会暂停其他进程，故结果偏差较大，下面我们修改list大小，改为50万，重新执行

多个样本比对，下面是200万数据量的

由此看出，全量比对较摘要比对性能更好，并且能得到2个集合具体的差异。gc次数也更少。

posted @ 2018-12-21 23:43 yao_1 阅读(1530) 评论(0) 收藏举报

刷新页面返回顶部

yao_1

百万数量级的2个集合差异性对比的思考

公告