python 字典查询提速的小技巧

考虑一个问题:一个python的字典,有1000万个key-value对,新插入1000对键值对,怎么速度才最快

自己测试了一部分代码,慢速的要300秒,加速的只要0.3秒,原因是慢速的代码每次查询非常费时,

if k in C14.keys()可能是这句话的问题,
改进后使用
defaultdict(int)方法提速!不要用dict()初始化方法了...

原始代码:极其慢(尤其是原始字典很大的时候)

#test slower code
import pandas as pd
import pickle
from collections import Counter
import os
from tqdm import tqdm
import time
from collections import defaultdict

C14 = dict() #注意这里没有用defaultdict
for i in tqdm(range(10000000)):
    C14[i] = i

print("start processing test data:")
s_time = time.time()


data = pd.read_csv('../../test.gz')
print("read test.gz over")

print("start to process C14:")
s_tt = time.time()

C14_list = data['C14'].values  #data是dataframe格式,data['C14'].values相当于一个list,比如[42,523,23,24,3,4,1,5,3]
for k,v in tqdm(Counter(C14_list).items()):

  if k in C14.keys():  #判断所消耗的时间很长
         C14[k] += v
  else:
         C14[k] = v
        
e_tt = time.time()
print("C14 over,cost time:{} seconds".format(e_tt-s_tt))
            
    

e_time = time.time()
print("test data processing over, cost {} minutes".format((e_time-s_time)/60))

 

 

改进后的代码:极快

#test code
import pandas as pd
import pickle
from collections import Counter
import os
from tqdm import tqdm
import time
from collections import defaultdict

C14 = defaultdict(int)   #使用python的defaultdict方法,意思是,如果key[value]的value不存在时,默认value值是int的0
for i in tqdm(range(10000000)):
    C14[i] = i

print("start processing test data:")
s_time = time.time()

data = pd.read_csv('../../test.gz')
print("read test.gz over")

print("start to process C14:")
s_tt = time.time()

C14_list = data['C14'].values
for k,v in tqdm(Counter(C14_list).items()):
    C14[k] += v
#下面四行话可以全部注释掉了
     #if k in C14.keys():  
         #C14[k] += v
     #else:
         #C14[k] = v
        
e_tt = time.time()
print("C14 over,cost time:{} seconds".format(e_tt-s_tt))
            
    
e_time = time.time()
print("test data processing over, cost {} minutes".format((e_time-s_time)/60))

 

posted @ 2020-12-18 20:46  qiezi_online  阅读(1443)  评论(0编辑  收藏  举报