数据转换¶

数据替换¶

替换操作可以作用于Series和DataFrame中

索引替换:rename(index={},columns={})

In [17]:

import numpy as np #只可以处理数值形式的数据
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
                  index = list('ABCDEFHIJK'),
                  columns=['Python','Tensorflow','Keras'])
df

Out[17]:

	Python	Tensorflow	Keras
A	0	4	8
B	2	1	2
C	3	5	3
D	9	8	8
E	6	0	3
F	2	3	0
H	3	1	9
I	2	0	2
J	8	5	5
K	5	0	9

In [19]:

df.rename(index={'A':'a','C':'c'},columns={'Python':'Java'})

Out[19]:

	Java	Tensorflow	Keras
a	0	4	8
B	2	1	2
c	3	5	3
D	9	8	8
E	6	0	3
F	2	3	0
H	3	1	9
I	2	0	2
J	8	5	5
K	5	0	9

值的替换:replace

In [21]:

#全局的单值替换
df.replace(to_replace=0,value='zero')

Out[21]:

	Python	Tensorflow	Keras
A	zero	4	8
B	2	1	2
C	3	5	3
D	9	8	8
E	6	zero	3
F	2	3	zero
H	3	1	9
I	2	zero	2
J	8	5	5
K	5	zero	9

In [25]:

#全局的多值替换
df.replace(to_replace={0:'zero',1:'one'})

Out[25]:

	Python	Tensorflow	Keras
A	zero	4	8
B	2	one	2
C	3	5	3
D	9	8	8
E	6	zero	3
F	2	3	zero
H	3	one	9
I	2	zero	2
J	8	5	5
K	5	zero	9

In [27]:

#指定列的元素替换
df.replace(to_replace={'Python':0},value='zero') #仅仅将Python列中的0进行了替换

Out[27]:

	Python	Tensorflow	Keras
A	zero	4	8
B	2	1	2
C	3	5	3
D	9	8	8
E	6	0	3
F	2	3	0
H	3	1	9
I	2	0	2
J	8	5	5
K	5	0	9

数学和统计方法¶

常用操作¶

df.count(axis) ,非NAN值的数量
df.max(axis = 0) ,轴0最大值，即每一列最大值
df.min(axis) ,默认计算轴0最小值
df.median() , 中位数
df.sum() , 求和
df.mean(axis = 1) ,轴1平均值，即每一行的平均值
df.cumsum() , 累加
df.cumprod() , 累乘
df.std() , 标准差
df.var() , 方差
df.quantile(q = [0.2,0.4,0.8]) , 分位数
df.pct_change(),将每个元素与其前一个元素进行比较，并计算前后数值的百分比变化
df['Python'].rank(),对序列中的元素值排名，该函数的返回值的也是一个序列，包含了原序列中每个元素值的名次。如果序列中包含两个相同的的元素值，那么会为其分配两者的平均排名

In [28]:

import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,100,size = (20,3)),
                  index = list('ABCDEFHIJKLMNOPQRSTU'),
                  columns=['Python','Tensorflow','Keras'])
df.loc['B','Python'] = None
df.loc['E','Python'] = None
df

Out[28]:

	Python	Tensorflow	Keras
A	7.0	98	19
B	NaN	41	44
C	51.0	99	39
D	1.0	64	12
E	NaN	78	44
F	58.0	61	15
H	13.0	76	91
I	34.0	3	96
J	90.0	35	27
K	28.0	68	52
L	67.0	38	60
M	72.0	4	63
N	91.0	26	55
O	56.0	85	38
P	22.0	63	8
Q	9.0	51	97
R	55.0	90	40
S	4.0	27	14
T	77.0	55	90
U	8.0	97	1

In [29]:

#统计df表格中每一列非空元素的个数
df.count(axis=0) 

Out[29]:

Python        18
Tensorflow    20
Keras         20
dtype: int64

In [30]:

#求出df表格中每一列的中位数
df.median(axis=0)

Out[30]:

Python        42.5
Tensorflow    62.0
Keras         42.0
dtype: float64

In [32]:

df.head(3)

Out[32]:

	Python	Tensorflow	Keras
A	7.0	98	19
B	NaN	41	44
C	51.0	99	39

In [31]:

#对df的每一列进行元素累加
df.cumsum(axis=0)

Out[31]:

	Python	Tensorflow	Keras
A	7.0	98	19
B	NaN	139	63
C	58.0	238	102
D	59.0	302	114
E	NaN	380	158
F	117.0	441	173
H	130.0	517	264
I	164.0	520	360
J	254.0	555	387
K	282.0	623	439
L	349.0	661	499
M	421.0	665	562
N	512.0	691	617
O	568.0	776	655
P	590.0	839	663
Q	599.0	890	760
R	654.0	980	800
S	658.0	1007	814
T	735.0	1062	904
U	743.0	1159	905

In [34]:

df.head(3)

Out[34]:

	Python	Tensorflow	Keras
A	7.0	98	19
B	NaN	41	44
C	51.0	99	39

In [33]:

#从大到小进行排名
df['Keras'].rank(ascending=False)

Out[33]:

A    15.0
B     9.5
C    12.0
D    18.0
E     9.5
F    16.0
H     3.0
I     2.0
J    14.0
K     8.0
L     6.0
M     5.0
N     7.0
O    13.0
P    19.0
Q     1.0
R    11.0
S    17.0
T     4.0
U    20.0
Name: Keras, dtype: float64

数据集成¶

级联操作¶

pandas使用pd.concat函数实现数据表格的级联，与np.concatenate函数类似。

匹配级联

In [36]:

import numpy as np
df1 = pd.DataFrame(data=np.random.randint(0,100,size=(4,3)),columns=['A','B','C'])
df2 = pd.DataFrame(data=np.random.randint(0,100,size=(4,4)),columns=['A','B','C','D'])

In [37]:

df1

Out[37]:

	A	B	C
0	16	53	21
1	19	33	74
2	29	36	61
3	40	47	16

In [38]:

df2

Out[38]:

	A	B	C	D
0	37	14	14	15
1	10	37	85	55
2	58	68	64	54
3	34	54	54	61

In [41]:

pd.concat((df1,df1),axis=0)

Out[41]:

	A	B	C
0	16	53	21
1	19	33	74
2	29	36	61
3	40	47	16
0	16	53	21
1	19	33	74
2	29	36	61
3	40	47	16

不匹配级联

不匹配指的是级联的维度的索引不一致。例如纵向级联时列索引不一致，横向级联时行索引不一致
有2种连接方式：
- 外连接：补NaN（默认模式）
- 内连接：只连接匹配的项

In [48]:

pd.concat((df1,df2),axis=0)
#reset_index()给表格重新设置行索引

Out[48]:

	index	A	B	C	D
0	0	16	53	21	NaN
1	1	19	33	74	NaN
2	2	29	36	61	NaN
3	3	40	47	16	NaN
4	0	37	14	14	15.0
5	1	10	37	85	55.0
6	2	58	68	64	54.0
7	3	34	54	54	61.0

数据合并¶

merge与concat的区别在于，merge需要依据某一共同列来进行表格的数据合并

一对一合并

In [49]:

from pandas import DataFrame
df1 = DataFrame({'employee':['Bob','Jake','Lisa'],
                'group':['Accounting','Engineering','Engineering'],
                })
df2 = DataFrame({'employee':['Lisa','Bob','Jake'],
                'hire_date':[2004,2008,2012],
                })

In [50]:

df1

Out[50]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [51]:

df2

Out[51]:

	employee	hire_date
0	Lisa	2004
1	Bob	2008
2	Jake	2012

In [52]:

pd.merge(left=df1,right=df2,on='employee')

Out[52]:

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004

一对多合并

In [53]:

df3 = DataFrame({
    'employee':['Lisa','Jake'],
    'group':['Accounting','Engineering'],
    'hire_date':[2004,2016]})
df4 = DataFrame({'group':['Accounting','Engineering','Engineering'],
                       'supervisor':['Carly','Guido','Steve']
                })

In [54]:

df3

Out[54]:

	employee	group	hire_date
0	Lisa	Accounting	2004
1	Jake	Engineering	2016

In [55]:

df4

Out[55]:

	group	supervisor
0	Accounting	Carly
1	Engineering	Guido
2	Engineering	Steve

In [56]:

pd.merge(left=df3,right=df4,on='group')

Out[56]:

	employee	group	hire_date	supervisor
0	Lisa	Accounting	2004	Carly
1	Jake	Engineering	2016	Guido
2	Jake	Engineering	2016	Steve

多对多合并

In [57]:

df5 = DataFrame({'employee':['Bob','Jake','Lisa'],
                 'group':['Accounting','Engineering','Engineering']})
df6 = DataFrame({'group':['Engineering','Engineering','HR'],
                'supervisor':['Carly','Guido','Steve']
                })

In [58]:

df5

Out[58]:

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering

In [59]:

df6

Out[59]:

	group	supervisor
0	Engineering	Carly
1	Engineering	Guido
2	HR	Steve

In [60]:

pd.merge(left=df5,right=df6,on='group') #默认情况下进行的是内连接（取交集）

Out[60]:

	employee	group	supervisor
0	Jake	Engineering	Carly
1	Jake	Engineering	Guido
2	Lisa	Engineering	Carly
3	Lisa	Engineering	Guido

In [61]:

pd.merge(left=df5,right=df6,on='group',how='outer')

Out[61]:

	employee	group	supervisor
0	Bob	Accounting	NaN
1	Jake	Engineering	Carly
2	Jake	Engineering	Guido
3	Lisa	Engineering	Carly
4	Lisa	Engineering	Guido
5	NaN	HR	Steve

In [62]:

pd.merge(left=df5,right=df6,on='group',how='left')

Out[62]:

	employee	group	supervisor
0	Bob	Accounting	NaN
1	Jake	Engineering	Carly
2	Jake	Engineering	Guido
3	Lisa	Engineering	Carly
4	Lisa	Engineering	Guido

In [63]:

pd.merge(left=df5,right=df6,on='group',how='right')

Out[63]:

	employee	group	supervisor
0	Jake	Engineering	Carly
1	Lisa	Engineering	Carly
2	Jake	Engineering	Guido
3	Lisa	Engineering	Guido
4	NaN	HR	Steve

key的规范化
- 当两张表没有可进行连接的列时，可使用left_on和right_on手动指定merge中左右两边的哪一列列作为连接的列

In [64]:

df7 = DataFrame({'employee':['Bobs','Linda','Bill'],
                'group':['Accounting','Product','Marketing'],
               'hire_date':[1998,2017,2018]})
df8 = DataFrame({'name':['Lisa','Bobs','Bill'],
                'hire_dates':[1998,2016,2007]})

In [65]:

df7

Out[65]:

	employee	group	hire_date
0	Bobs	Accounting	1998
1	Linda	Product	2017
2	Bill	Marketing	2018

In [66]:

df8

Out[66]:

	name	hire_dates
0	Lisa	1998
1	Bobs	2016
2	Bill	2007

In [67]:

pd.merge(left=df7,right=df8,left_on='employee',right_on='name') #shift+tab帮助文档的提示

Out[67]:

	employee	group	hire_date	name	hire_dates
0	Bobs	Accounting	1998	Bobs	2016
1	Bill	Marketing	2018	Bill	2007

项目:更新一个电子表格¶

这个项目需要编写一个程序，更新产品销售电子表格中的单元格。程序将遍历这个电子表格，找到特定类型的产品，并更新它们的价格

数据说明：
- 每一行代表一次单独的销售。列分别是销售产品的类型(PRODUCE)、产品每磅的价格(COST PER POUND )、销售的磅数(POUNDS SOLD )，以及这次销售的总收入（TOTAL）。
- 现在假设 Garlic、Celery 和 Lemons 的价格输入的不正确。这让你面对一项无聊的任务:遍历这个电子表格中的几万行，更新所有 Garlic、Celery 和 :emon 行中每磅的价格。你不能简单地对价格查找替换，因为可能有其他的产品价格一样，你不希望错误地“更正”。对于几万行数据，手工操作可能要几小时。但你可以编写程序，几秒钟内完成这个任务。
- 更新后价格为：
  - Garlic：3.07
  - Celery：1.19
  - Lemon：1.27

In [68]:

#./data/produceSales.xlsx
data = pd.read_excel('../data/produceSales.xlsx')
data.head()

Out[68]:

	PRODUCE	COST PER POUND	POUNDS SOLD	TOTAL
0	Potatoes	0.86	21.6	18.58
1	Okra	2.26	38.6	87.24
2	Fava beans	2.69	32.8	88.23
3	Watermelon	0.66	27.3	18.02
4	Garlic	1.19	4.9	5.83

In [74]:

dic = {
    'Garlic':3.07,
    'Celery':1.19,
    'Lemon':1.27
}
for key,value in dic.items():
    ex = data['PRODUCE'] == key
    df = data.loc[ex] #定位到了需要修改单价的商品对应的行数据
    indexs = df.index #获取要修改单价的行数据的行索引
    
    #将indexs对应行中的单价批量修改成value表示的新单价
    data.loc[indexs,'COST PER POUND'] = value
    #更新总价total
    data.loc[indexs,'TOTAL'] = value * data['POUNDS SOLD'][indexs]
    
#将数据更新到新的excel文件中
data.to_excel('update_produceSales.xlsx')

手机销量分析案例¶

巩固分组聚合操作

In [81]:

#加载数据
import pandas as pd
data = pd.read_excel('../data/Phone.xlsx')
data.head()

Out[81]:

	订单号	订单日期	年	月	地区名字	省份名字	城市名字	品牌	型号	运行内存	机身内存	数量	用户名	用户姓名	年龄	年龄段	性别	手机号	价格	销售额
0	20180301004758	2020-01-14	NaN	NaN	中南地区	广西壮族自治区	梧州市	荣耀	荣耀9X	6G	64G	2	RVwhqiwMFc	刘捷	33	NaN	男	13794074871	1299	2598
1	20180301004759	2018-01-20	NaN	NaN	华东地区	浙江省	舟山市	三星	Galaxy A50s	6G	128G	5	hICxjenVeM	陈盼妙	31	NaN	女	13820844520	1869	9345
2	20180301004760	2019-06-15	NaN	NaN	西北地区	甘肃省	白银市	小米	红米K30 Pro	8G	256G	3	RSXOFBOwki	张浩	18	NaN	男	15931162888	3999	11997
3	20180301004761	2019-01-07	NaN	NaN	中南地区	河南省	许昌市	小米	红米Note8	8G	128G	6	OtUMUlCBuK	辛倩	31	NaN	女	13084447501	1518	9108
4	20180301004762	2019-05-21	NaN	NaN	直辖市	北京市	北京市	vivo	New 3S	6G	128G	4	eikoQvIyUR	徐旭	33	NaN	女	13226875372	5298	21192

In [82]:

data.shape

Out[82]:

(41800, 20)

In [83]:

#缺失值处理
data.isnull().any(axis=0) #年、月和年龄段三列存在缺失数据

Out[83]:

订单号     False
订单日期    False
年        True
月        True
地区名字    False
省份名字    False
城市名字    False
品牌      False
型号      False
运行内存    False
机身内存    False
数量      False
用户名     False
用户姓名    False
年龄      False
年龄段      True
性别      False
手机号     False
价格      False
销售额     False
dtype: bool

In [85]:

#查看缺失数据的占比
for col in data.columns:
    if data[col].isnull().sum() > 0:#条件满足则表示col列存在空值
        #计算col列的空值占比
        null_count = data[col].isnull().sum() #空值的数量
        null_rate = null_count / data[col].size
        print(col,null_rate)

年 1.0
月 1.0
年龄段 1.0

In [91]:

#可以将订单日期中的年份和月份单独取出来，赋值到年和月两列中
data['年'] = data['订单日期'].dt.year

In [109]:

def get_month(d):
    d = str(d)
    year = d.split('-')[0]
    month = d.split('-')[1]
    return year + '-' + month
       
data['月'] = data['订单日期'].map(get_month)

In [112]:

#填充年龄段中的空值（数据分箱）
data['年龄'].describe() #发现年龄是在16-49之间
#人为制定几个年龄段：16-25,26-35,36-49

Out[112]:

count    41800.000000
mean        25.508565
std          6.315559
min         16.000000
25%         20.000000
50%         21.000000
75%         31.000000
max         49.000000
Name: 年龄, dtype: float64

In [115]:

data['年龄段'] = pd.cut(data['年龄'],bins=[16,25,35,49])
#bins表示生成几个箱子：16-25,26-35,36-49

In [124]:

#查看不同品牌手机的累计销量和累计销售额,且对累计销量进行降序
ret = data.groupby(by='品牌')[['数量','销售额']].sum().rename(
    columns={'数量':'累计销量','销售额':'累计销售额'})
#对累计销量进行降序
ret.sort_values(by='累计销量',ascending=False)

Out[124]:

	累计销量	累计销售额
品牌
vivo	20601	60274031
小米	17889	41897903
iphone	14954	80227880
华为	14623	48727562
三星	13551	64473019
中兴	12981	20781321
魅族	12532	21812491
oppo	12454	51575446
荣耀	12270	22397210
联想	8592	10439004
一加	6106	25747594

In [130]:

#查看不同月份的销量情况，哪些月份销量比较高
ret = data.groupby(by='月')['数量'].sum().sort_values(ascending=False)
ret.reset_index() #reset_index()可以将一个series快速转变成表格显示

Out[130]:

	月	数量
0	2020-03	5647
1	2018-12	5643
2	2018-01	5613
3	2019-07	5604
4	2018-08	5574
5	2019-05	5559
6	2019-01	5542
7	2018-11	5526
8	2019-10	5491
9	2019-03	5488
10	2018-06	5478
11	2019-04	5471
12	2018-05	5467
13	2018-03	5447
14	2019-11	5434
15	2018-09	5425
16	2019-12	5417
17	2018-04	5392
18	2018-07	5383
19	2020-02	5379
20	2018-10	5342
21	2019-08	5310
22	2020-01	5265
23	2019-06	5255
24	2019-09	5219
25	2018-02	5193
26	2019-02	4989

In [132]:

#不同年龄段的购买力
data.groupby(by='年龄段')['订单号'].count()

Out[132]:

年龄段
(16, 25]    21229
(25, 35]    19595
(35, 49]      462
Name: 订单号, dtype: int64

In [138]:

#查看不同城市的购买力情况
data.groupby(by='城市名字')['订单号'].count().sort_values(ascending=False).reset_index()

Out[138]:

	城市名字	订单号
0	上海市	6326
1	北京市	6301
2	张家口市	259
3	秦皇岛市	249
4	石家庄市	248
...	...	...
287	潍坊市	69
288	湛江市	69
289	兰州市	68
290	漳州市	66
291	许昌市	65

292 rows × 2 columns

In [141]:

#查看不同品牌的不同型号的最高和最低价格是多少
data.groupby(by=['品牌','型号'])['价格'].agg(['max','min'])

Out[141]:

		max	min
品牌	型号
iphone	iPhone 11	5999	5999
	iPhone 11 Pro	9999	9999
	iPhone 11 Pro Max	10899	10899
	iPhone 6s Plus	1198	1198
	iPhone 7	2899	2899
...	...	...	...
魅族	魅族16Xs	1499	1499
	魅族16sPro	2959	2959
	魅族16spro	2959	2959
	魅族16th	1988	1988
	魅族Note9 s	1026	1026

114 rows × 2 columns

In [175]:

data['品牌'][[0,3]] = ['aa','bb']

Out[175]:

0    荣耀
3    小米
Name: 品牌, dtype: object

美国大选政治现金分析：

加载数据
查看数据的基本信息
指定数据截取，将如下字段的数据进行提取，其他数据舍弃
- cand_nm ：候选人姓名
- contbr_nm ：捐赠人姓名
- contbr_st ：捐赠人所在州
- contbr_employer ：捐赠人所在公司
- contbr_occupation ：捐赠人职业
- contb_receipt_amt ：捐赠数额（美元）
- contb_receipt_dt ：捐款的日期
对新数据进行总览,查看是否存在缺失数据
用统计学指标快速描述数值型属性的概要。
空值处理。可能因为忘记填写或者保密等等原因，相关字段出现了空值，将其填充为NOT PROVIDE
异常值处理。将捐款金额<=0的数据删除
新建一列为各个候选人所在党派party
查看party这一列中有哪些不同的元素
统计party列中各个元素出现次数
查看各个党派收到的政治献金总数contb_receipt_amt
查看具体每天各个党派收到的政治献金总数contb_receipt_amt
将表中日期格式转换为'yyyy-mm-dd'。
查看老兵(捐献者职业)DISABLED VETERAN主要支持谁

In [142]:

#加载数据:usa_election.txt
df = pd.read_csv('../data/usa_election.txt').drop(columns='Unnamed: 0')
df.head()

Out[142]:

	cand_nm	contbr_nm	contbr_st	contbr_employer	contbr_occupation	contb_receipt_amt	contb_receipt_dt
0	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	250.0	20-JUN-11
1	Bachmann, Michelle	HARVEY, WILLIAM	AL	RETIRED	RETIRED	50.0	23-JUN-11
2	Bachmann, Michelle	SMITH, LANIER	AL	INFORMATION REQUESTED	INFORMATION REQUESTED	250.0	05-JUL-11
3	Bachmann, Michelle	BLEVINS, DARONDA	AR	NONE	RETIRED	250.0	01-AUG-11
4	Bachmann, Michelle	WARDENBURG, HAROLD	AR	NONE	RETIRED	300.0	20-JUN-11

In [143]:

#对新数据进行总览,查看是否存在缺失数据
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 7 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   cand_nm            536041 non-null  object 
 1   contbr_nm          536041 non-null  object 
 2   contbr_st          536040 non-null  object 
 3   contbr_employer    525088 non-null  object 
 4   contbr_occupation  530520 non-null  object 
 5   contb_receipt_amt  536041 non-null  float64
 6   contb_receipt_dt   536041 non-null  object 
dtypes: float64(1), object(6)
memory usage: 28.6+ MB

In [144]:

#用统计学指标快速描述数值型属性的概要
df.describe()

Out[144]:

	contb_receipt_amt
count	5.360410e+05
mean	3.750373e+02
std	3.564436e+03
min	-3.080000e+04
25%	5.000000e+01
50%	1.000000e+02
75%	2.500000e+02
max	1.944042e+06

In [149]:

#将捐赠金额中小于0的数据进行删除
ex = df['contb_receipt_amt'] < 0
df.loc[ex] #获取了异常值对应的行数据
indexs = df.loc[ex].index
#删除异常数据
df.drop(index=indexs,inplace=True)

In [154]:

#新建一列为各个候选人所在党派party
parties = {
  'Bachmann, Michelle': 'Republican',
  'Romney, Mitt': 'Republican',
  'Obama, Barack': 'Democrat',
  "Roemer, Charles E. 'Buddy' III": 'Reform',
  'Pawlenty, Timothy': 'Republican',
  'Johnson, Gary Earl': 'Libertarian',
  'Paul, Ron': 'Republican',
  'Santorum, Rick': 'Republican',
  'Cain, Herman': 'Republican',
  'Gingrich, Newt': 'Republican',
  'McCotter, Thaddeus G': 'Republican',
  'Huntsman, Jon': 'Republican',
  'Perry, Rick': 'Republican'           
 }
df['party'] = df['cand_nm'].map(parties)

In [156]:

#查看party这一列中有哪些不同的元素
df['party'].unique()

Out[156]:

array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)

In [159]:

#统计party列中各个元素出现次数
df['party'].value_counts()

Out[159]:

Democrat       290003
Republican     234300
Reform           5313
Libertarian       702
Name: party, dtype: int64

In [161]:

#查看各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by='party')['contb_receipt_amt'].sum().sort_values(ascending=False)

Out[161]:

party
Republican     1.251181e+08
Democrat       8.259441e+07
Libertarian    4.132769e+05
Reform         3.429658e+05
Name: contb_receipt_amt, dtype: float64

In [163]:

#查看具体每天各个党派收到的政治献金总数contb_receipt_amt
df.groupby(by=['contb_receipt_dt','party'])['contb_receipt_amt'].sum()

Out[163]:

contb_receipt_dt  party      
01-APR-11         Reform             50.00
                  Republican      12635.00
01-AUG-11         Democrat       182198.00
                  Libertarian      1000.00
                  Reform           1847.00
                                   ...    
31-MAY-11         Republican     313839.80
31-OCT-11         Democrat       216971.87
                  Libertarian      4250.00
                  Reform           3205.00
                  Republican     751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64

In [167]:

months = {'JAN' : 1, 'FEB' : 2, 'MAR' : 3, 'APR' : 4, 'MAY' : 5, 'JUN' : 6,
          'JUL' : 7, 'AUG' : 8, 'SEP' : 9, 'OCT': 10, 'NOV': 11, 'DEC' : 12}
#将表中日期格式转换为'yyyy-mm-dd'。
def transform_date(d):
    day,month,year = d.split('-')
    month = months[month]
    return '20'+year + '-' + str(month) + '-' + day
df['contb_receipt_dt'] = df['contb_receipt_dt'].map(transform_date)

In [171]:

#查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
    #先将老兵职业对应的行数据取出
ex = df['contbr_occupation'] == 'DISABLED VETERAN'
old_bing_df = df.loc[ex] #获取了老兵对应的行数据

In [172]:

old_bing_df.groupby(by='cand_nm')['contb_receipt_amt'].sum()

Out[172]:

cand_nm
Cain, Herman       300.00
Obama, Barack     4205.00
Paul, Ron         2425.49
Santorum, Rick     250.00
Name: contb_receipt_amt, dtype: float64

In [173]:

old_bing_df.groupby(by='cand_nm').size() #size可以直接计算每组数据的行数

Out[173]:

cand_nm
Cain, Herman       3
Obama, Barack     32
Paul, Ron         22
Santorum, Rick     3
dtype: int64

In [ ]:

fuminer

day02：pandas数据高级处理