工作笔记
问题:
有两个CSV文件,文件A含两列id, 暂叫 id1, id2,文件B中含有列id2和一些有用信息。
目标:
合并文件A和B,以id2进行关联,为文件B增加一列id1,实际上就是SQL中的join。要求不通过数据库来完成。
原文件分析:
文件格式
查看文件格式,及换行 file B.csv
格式为 B.csv: Non-ISO extended-ASCII text, with CRLF, LF line terminators
用 cat B.csv | head 查看行尾符为
zp0008162,M-2M-\M-CM-OM-@M-v,EM-MM-7M->M-1M-2M-?M-VM-WM-AM-vM-?M-F,EM-;M-$M-JM-?M-UM-PM-FM-8M-WM-(M-<M-<, , ,132,0,2565,0,0,1735.59,1800,20,308,0,2128,0,0,0,0,0,181.6,72,6.81,0,0,392.41,1715.59,1295.59,001101021193661,2017-11-18^M$
zp0110069,M-2M-qM-QM-`M-WM-S,EM-IM-vM-2M-!M-DM-ZM-?M-F,EM-;M-$M-JM-?M-UM-PM-FM-8M-WM-(M-<M-<, , ,132,0,2493,0,0,1535.59,1600,20,308,0,1928,0,0,0,0,0,181.6,72,6.81,0,0,392.41,1515.59,1095.59,001101021755216,2017-11-18^M$
zp0150086,M-3M-#M-;M-",EM-7M-EM-IM-dM-SM-0M-OM-qM-?M-F,EM-<M-<M-JM-&M-UM-PM-FM-8, , ,465,0,3697,0,0,2588.73,2854.5,0,536,0,3390.5,0,0,0,0,0,255.2,72,9.57,0,0,801.77,2588.73,1587.73,231804001386838,2017-11-18^M$
zp0160180,M-3M-BM-3M-?,EM-VM-'M-VM-zM-VM-PM-PM-D,EM-UM-PM-FM-8M-7M-GM-WM-(M-<M-<, , ,0,0,1770,0,0,1379.59,1320,20,300,0,1640,0,0,0,0,0,181.6,72,6.81,0,0,260.41,1359.59,1079.59,231804002354736,2017-11-18^M$
文件是以 ^M$ 结尾的,在ubuntu下用 fromdos B.csv 处理一下。处理完后用 cat -A B.csv 查看文件,文件以 $ 结尾。
字符编码
文件是GBK编码的,处理一下换成UTF-8: iconv -f 'gbk' -t 'utf-8' B.csv
如果要处理的文件比较多,可以用awk生成执行列表。awk 处理单引号如下,单引号要转义并且要用单引号再包一层:
ls *.csv | awk '{print "iconv -f '\'gbk\'' -t '\''utf-8'\'' " $1 " > utf8_"$1}' > proc.sh
处理完后使用 file B.csv 查看文件,文件已完为 B.csv: UTF-8 Unicode text
在查看原数据文件时,发现csv文件内容部分行有问题。估计是一些行太长做了换行处理导致列对不上了。

以文本方式查看文件内容,发现一些行以逗号开头与WPS中看到的现像一致。

下面要处理这些问题行,将这些行找出来,与上一行合并。先通过sed找出这些行 sed -n '/^,/p' B.csv

确认问题后,还是使用sed进行转换:
sed -i -e :a -e '$!N;s/\n,/,/;ta' -e 'P;D' B.csv
转换之前可以先备份一下,然后通过 diff -up a b 查看转换的效果。

从diff的效果来看,有问题的行,已经两行合一行了。
文件都合法后,开始写程序处理。程序逻辑:对文件A的两列,id1,id2,组建字典,以id2为key。将文件B中的列id2作key来引用字典,将key=>value插入每一行。
1 #!/usr/local/bin/python3 2 # -*- coding: utf-8 -*- 3 4 import csv 5 import sys 6 7 class User: 8 def __init__(self, file): 9 with open(file=file, mode='r', encoding='utf8') as f: 10 L = [i for i in csv.reader(f)] 11 del L[0] 12 self.d = [[i[0], i[-2]] for i in L if i[-2]] 13 self.dic = {i[-2]:i[0] for i in L if i[-2]} 14 15 def __repr__(self): 16 s = '' 17 for i in self.d: 18 s += '%s' % i 19 s += '\n' 20 return s 21 22 def __str__(self): 23 print(self.dic) 24 25 def cvs_output(L, output=sys.stdout): 26 for i in L: 27 for j in i: 28 output.write(j) 29 if j != i[-1]: 30 output.write(',') 31 output.write('\n') 32 33 def cvs_output(L, fn): 34 with open(file=fn, mode='wt') as f: 35 w = csv.writer(f) 36 for i in L: 37 w.writerow(i) 38 39 40 def cvs_output1(L, fn): 41 with open(file=fn, mode='wt') as f: 42 for i in L: 43 c = 0 44 for j in i: 45 f.write(j) 46 if c != len(i) - 1: 47 f.write(',') 48 c += 1 49 f.write('\n') 50 51 52 def proc_file(file_name, d={}, fmode='rt'): 53 with open(file=file_name, mode=fmode, encoding='utf8') as f: 54 L = [i for i in csv.reader(f)] 55 title = L[0] 56 title.insert(0, 'id') 57 del L[0] 58 #print(title) 59 #print((L[0][0])) 60 #k = L[0][0] 61 #print(d.get(k)) 62 tgt = [] 63 tgt.append(title) 64 for i in L: 65 if d.get(i[0]): 66 #print(i[0],d.get(i[0])) 67 i.insert(0, d.get(i[0])) 68 #print(i) 69 #sys.exit() 70 tgt.append(i) 71 #print(tgt) 72 return tgt 73 74 75 76 if __name__ == '__main__': 77 78 path_pre = 'data_salary/2017_10-11/source/' 79 tgt_pre = 'data_salary/2017_10-11/target/' 80 user_list = 'data_salary/2017_10-11/source/utf8_userlist.csv' 81 82 inputlist = [ 83 'utf8_2017.10东_zp.csv', 84 'utf8_2017.10南_zp.csv', 85 'utf8_2017.10北_zp.csv', 86 'utf8_2017.11东_zb.csv', 87 'utf8_2017.11南_zb.csv', 88 'utf8_2017.11北_zb.csv', 89 ] 90 91 outlist = [] 92 for i in range(6): 93 s = tgt_pre + inputlist[i] 94 outlist.append(s) 95 96 u = User(file=user_list) 97 print("%s\t%s" % ('钉钉用户:\t', len(u.dic))) 98 c = 0 99 c1 = 0 100 for x in inputlist: 101 s = path_pre + x 102 t = proc_file(file_name=s, d=u.dic) 103 cvs_output1(t, fn=outlist[c1]) 104 c1 += 1 105 count = len(t) 106 print("%s\t%s" % (x, count)) 107 c += count 108 print("共计:\t\t\t%s" % c)
posted on 2017-12-13 16:57 Digital_life 阅读(284) 评论(0) 收藏 举报
浙公网安备 33010602011771号