在我的工作中,有时候要处理大批量的数据的excel表格,通常是几万行的,如果手工做的话,大概需要半天时间,而且还不一定能保证不出错,如果是筛选数据的话,还不能对筛选好的数据直接进行求取平均值的需求,必须拷贝到另一张表里进行平均值操作。下面我针对具体的需求,来说一下python是如何处理的。

     需求:对OPER、STATENAME两列进行筛选,得到CPU、ELAPSE两列的数据,然后求出它们的平均值,最好有几个OPER生成几张表

     代码:

 

代码
1 import xlrd,sys
2  from pyExcelerator import*
3 OPER =5
4 CPU =6
5 ELAPSE =8
6 STATENAME =15
7 FILENAME ="WUPHG1HTC.xls"
8  class CpuTestTime:
9 StateTable = {}
10 OperList = []
11 def__init__(self,filename):
12 self.fname = filename
13
14 def openExcelFile(self,index):
15 self.bk = xlrd.open_workbook(self.fname)
16 self.sh = self.bk.sheets()[index]
17 self.nrows = self.sh.nrows
18 self.dic3 = {}
19
20 def cpuTestTime(self):
21 for i in range(1,self.nrows):
22 ifnot str(self.sh.row(i)[OPER])[7:-1] in CpuTestTime.OperList: #oper
23   CpuTestTime.OperList.append(str(self.sh.row(i)[OPER])[7:-1])
24 CpuTestTime.StateTable[str(self.sh.row(i)[OPER])[7:-1]]=[]
25 for i in range(1,self.nrows):
26 ifnot str(self.sh.row(i)[STATENAME])[7:-1] in CpuTestTime.StateTable[str(self.sh.row(i)[OPER])[7:-1]]:#statename
27   CpuTestTime.StateTable[str(self.sh.row(i)[OPER])[7:-1]].append(str(self.sh.row(i)[STATENAME])[7:-1])
28
29 dic1 ={}
30 dic2 ={}
31 for key,values in CpuTestTime.StateTable.items():
32 for value in values:
33 for tt in ['ELAPSED_TIME','CPU_ELAPSED_TIME','COUNT']:
34 dic1.update({tt:0})
35 dic2.update({value:dic1.copy()})
36 self.dic3.update({key:dic2.copy()})
37
38 for i in xrange(1,self.nrows):
39 self.dic3[str(self.sh.row(i)[OPER])[7:-1]][str(self.sh.row(i)[STATENAME])[7:-1]]['ELAPSED_TIME'] += self.sh.row(i)[ELAPSE].value
40 self.dic3[str(self.sh.row(i)[OPER])[7:-1]][str(self.sh.row(i)[STATENAME])[7:-1]]['CPU_ELAPSED_TIME'] += self.sh.row(i)[CPU].value
41 self.dic3[str(self.sh.row(i)[OPER])[7:-1]][str(self.sh.row(i)[STATENAME])[7:-1]]['COUNT'] +=1
42
43 for key,values in CpuTestTime.StateTable.items():
44 for value in values:
45 for tt in ['ELAPSED_TIME','CPU_ELAPSED_TIME']:
46 self.dic3[key][value][tt]=self.dic3[key][value][tt]/self.dic3[key][value]['COUNT']
47
48 def saveResult(self):
49 wb = Workbook()
50 for key,values in CpuTestTime.StateTable.items():
51 loop =1
52 ws = wb.add_sheet(key)
53 ws.write(0,0,'STATE')
54 ws.write(0,1,'ELAPSED_TIME')
55 ws.write(0,2,'CPU_ELAPSED_TIME')
56 ws.write(0,3,'COUNT')
57 for value in values:
58 ws.write(loop,0,value)
59 ws.write(loop,1,round(self.dic3[key][value]['ELAPSED_TIME'],2))
60 ws.write(loop,2,round(self.dic3[key][value]['CPU_ELAPSED_TIME'],2))
61 ws.write(loop,3,self.dic3[key][value]['COUNT'])
62 loop = loop+1
63 wb.save('result'+FILENAME)
64
65  if"__main__"==__name__:
66 excelFile = CpuTestTime(FILENAME)
67 excelFile.openExcelFile(0)
68 excelFile.cpuTestTime()
69 excelFile.saveResult()

 

      首先讲一下两个库:xlrd 和 pyExcelerator,xlrd从名字上来看就是读excel比较方便的,而pyExcelerator对于生成excel并写excel很方便。让我们分析一下上述代码的两个函数:

 

1 def openExcelFile(self,index):
2 self.bk = xlrd.open_workbook(self.fname)
3 self.sh = self.bk.sheets()[index]
4 self.nrows = self.sh.nrows
5 self.dic3 = {}

      行2是打开一个文件名为self.fname的excel工作薄,行3是获取工作薄中的第(index+1)张表,因为一个工作薄里可以新建好几张表,它们的索引从0开始,行4记录了这张表里数据的行数。下面读取每个表格的值就很容易了,sh.row(i)[3]这个就代表第i行的第4列.特别注意sh.row(i)[3][7:-1]才能取到正确的字符串值,和excel的编码有关。

 

代码
1 def saveResult(self):
2 wb = Workbook()
3 for key,values in CpuTestTime.StateTable.items():
4 loop =1
5 ws = wb.add_sheet(key)
6 ws.write(0,0,'STATE')
7 ws.write(0,1,'ELAPSED_TIME')
8 ws.write(0,2,'CPU_ELAPSED_TIME')
9 ws.write(0,3,'COUNT')
10 for value in values:
11 ws.write(loop,0,value)
12 ws.write(loop,1,round(self.dic3[key][value]['ELAPSED_TIME'],2))
13 ws.write(loop,2,round(self.dic3[key][value]['CPU_ELAPSED_TIME'],2))
14 ws.write(loop,3,self.dic3[key][value]['COUNT'])
15 loop = loop+1
16 wb.save('result'+FILENAME)

          上面这段代码,行2创建一个excel工作薄,行5增加一个表,行6写cell(0,0)单元格,值为最后一个参数'STATE'。行16保存到文件。

          这种方式处理excel表比较好懂,也比较简单,当然,也可以用windows的COM来处理,不过就没有这么简单了~

posted on 2010-06-02 14:00  笨毛毛  阅读(6973)  评论(1编辑  收藏  举报