PyMongo去除重复数据

转载自: 李冬琳的博客 URL: http://ldllidonglin.github.io/blog/2015/12/14/2015-12-14-mongodb%E5%8E%BB%E9%99%A4%E9%87%8D%E5%A4%8D%E6%95%B0%E6%8D%AE/

1. 唯一索引

db.things.ensureIndex({'key' : 1}, {unique : true, dropDups : true})

　　但是dropDups is not supported by MongoDB 2.7.5 or newer所以这个方法只能在2.7.5版本以下才行

2. 用aggreate找出重复的数据，然后再一个一个删除(效率比较低)，python代码

#先找到重复的数据
deleteData=collection.aggregate([
{'$group': { 
    '_id': { 'firstField': "$area", 'secondField': "$time_point" }, 
    'uniqueIds': { '$addToSet': "$_id" },
    'count': { '$sum': 1 } 
  }}, 
  { '$match': { 
    'count': { '$gt': 1 } 
  }}
]);
first=True
for d in deleteData:
    first=True
    for did in d['uniqueIds']:
        if !first:    #第一个不删除
            collection.delete_one({'_id':did});
        first=False

　　参考1
　　参考2

3. 第二种方法当数据量很大的时候，需要把数据写入表中。aggregate的pipeline中要加上out项，同时由于aggregate只接受两个参数，self是默认的，所以要用allowDiskUse=True这种形式添加参数

# 找出重复的放入result表中
def findDuplicate():
    deleteData=collection.aggregate([
        {'$group': {
            '_id': { 'firstField': "$mid", 'secondField': "$created_at" },
            'uniqueIds': { '$addToSet': "$_id" },
            'count': { '$sum': 1 }
            }
        },
        { '$match': {
            'count': { '$gt': 1 }
            }
        },{'$out':'result'}
    ],allowDiskUse=True); 

def deleteDup():
    deleteData=db.result.find()
    first=True
    for d in deleteData:
        first=True
        for did in d['uniqueIds']:
            if first==False:
                collection.delete_one({'_id':did});
            first=False

posted on 2020-07-16 20:08 天马行宇阅读(1413) 评论(0) 收藏举报