pandas笔记:ch09数据聚合与分组运算

 

ch09

 

 

 

数据聚合与分组运算

Data Aggregation and Group Operations

In [24]:
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
plt.rc('figure', figsize=(10, 6))
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)
In [25]:
pd.options.display.notebook_repr_html = False
In [26]:
%matplotlib inline
 

GroupBy 技术

In [27]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})
df
Out[27]:
      data1     data2 key1 key2
0 -0.204708  1.393406    a  one
1  0.478943  0.092908    a  two
2 -0.519439  0.281746    b  one
3 -0.555730  0.769023    b  two
4  1.965781  1.246435    a  one
In [5]:
'按 key1列 分组 data1列'
grouped = df['data1'].groupby(df['key1'])
grouped
'变量 grouped 是一个GroupBy对象。它没有进行任何计算,只是包含了对各分组执行计算所需要的一切信息'
Out[5]:
'变量 grouped 是一个GroupBy对象。它没有进行任何计算,只是包含了对各分组执行计算所需要的一切信息'
In [6]:
'计算分组平均值'
grouped.mean()
Out[6]:
key1
a    0.746672
b   -0.537585
Name: data1, dtype: float64
In [7]:
'传入多个数组[df["key1"], df["key2"]],会得到多个数组组成的层次化索引分组'
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
Out[7]:
key1  key2
a     one     0.880536
      two     0.478943
b     one    -0.519439
      two    -0.555730
Name: data1, dtype: float64
In [8]:
'unstack 层次化索引'
means.unstack()
Out[8]:
key2       one       two
key1                    
a     0.880536  0.478943
b    -0.519439 -0.555730
In [9]:
'分组键不一定必需是DataFrame的列,可以是任意长度适当的数组'
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()
Out[9]:
California  2005    0.478943
            2006   -0.519439
Ohio        2005   -0.380219
            2006    1.965781
Name: data1, dtype: float64
In [10]:
'df.groupby()传入列名:“key1” (不是列)对所有数值数据分组(默认剔除非数值数据,所以这里没有 key2 的信息)'
df.groupby('key1').mean()
Out[10]:
         data1     data2
key1                    
a     0.746672  0.910916
b    -0.537585  0.525384
In [11]:
'传入两个列名进行分组'
df.groupby(['key1', 'key2']).mean()
Out[11]:
              data1     data2
key1 key2                    
a    one   0.880536  1.319920
     two   0.478943  0.092908
b    one  -0.519439  0.281746
     two  -0.555730  0.769023
In [12]:
'.size: 返回一个含有分组大小的Series,注意只有a-one有两行数据,其他的只有1行数据'
df.groupby(['key1', 'key2']).size()
Out[12]:
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64
 

对分组进行迭代

Iterating over groups

In [13]:
"GroupBy:对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)"
for name, group in df.groupby('key1'):
    print(name)
    print(group)
 
a
      data1     data2 key1 key2
0 -0.204708  1.393406    a  one
1  0.478943  0.092908    a  two
4  1.965781  1.246435    a  one
b
      data1     data2 key1 key2
2 -0.519439  0.281746    b  one
3 -0.555730  0.769023    b  two
In [14]:
"对于多重情况,元组的第一个元素将会是由键值组成的元组"
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2))
    print(group)
 
('a', 'one')
      data1     data2 key1 key2
0 -0.204708  1.393406    a  one
4  1.965781  1.246435    a  one
('a', 'two')
      data1     data2 key1 key2
1  0.478943  0.092908    a  two
('b', 'one')
      data1     data2 key1 key2
2 -0.519439  0.281746    b  one
('b', 'two')
     data1     data2 key1 key2
3 -0.55573  0.769023    b  two
In [15]:
'你可以对这些数据片段做任何事情,比如:将这些数据片段做成一个字典'
pieces = dict(list(df.groupby('key1')))
pieces['b']
Out[15]:
      data1     data2 key1 key2
2 -0.519439  0.281746    b  one
3 -0.555730  0.769023    b  two
In [23]:
list(df.groupby('key1'))
Out[23]:
[('a',       data1     data2 key1 key2
  0 -0.204708  1.393406    a  one
  1  0.478943  0.092908    a  two
  4  1.965781  1.246435    a  one), ('b',       data1     data2 key1 key2
  2 -0.519439  0.281746    b  one
  3 -0.555730  0.769023    b  two)]
In [18]:
pieces
Out[18]:
{'a':       data1     data2 key1 key2
 0 -0.204708  1.393406    a  one
 1  0.478943  0.092908    a  two
 4  1.965781  1.246435    a  one, 'b':       data1     data2 key1 key2
 2 -0.519439  0.281746    b  one
 3 -0.555730  0.769023    b  two}
In [22]:
'此处用来学习notebook的用法,esc下按l健出现左端1,2,3,4等等数字序列,按下 o 屏蔽输出信息,按下 shift + o 键设置输出信息'
'为scrolling形式'
a = 3
b = 4
for  i in range(50):
    print(i)
       
 
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
In [16]:
df.dtypes
Out[16]:
data1    float64
data2    float64
key1      object
key2      object
dtype: object
In [17]:
'GroupBy默认是在 axis=0上进行分组的,也可以在 axis=1 上分组,这里依据 df.dtypes 对列进行分组'
grouped = df.groupby(df.dtypes, axis=1)
dict(list(grouped))
Out[17]:
{dtype('float64'):       data1     data2
 0 -0.204708  1.393406
 1  0.478943  0.092908
 2 -0.519439  0.281746
 3 -0.555730  0.769023
 4  1.965781  1.246435, dtype('O'):   key1 key2
 0    a  one
 1    a  two
 2    b  one
 3    b  two
 4    a  one}
 

选取一个或一组列

Selecting a column or subset of columns

 
df.groupby('key1')['data1']  
df.groupby('key1')[['data2']]  
是下面代码的等价  
df['data1'].groupby(df['key1'])  
df[['data2']].groupby(df['key1'])
In [25]:
df.groupby('key1')['data1'].mean()  
Out[25]:
key1
a    0.746672
b   -0.537585
Name: data1, dtype: float64
In [27]:
'注意这里的 ["data1"] 等价于 "data1" '
df.groupby('key1')[['data1']].mean()  
Out[27]:
         data1
key1          
a     0.746672
b    -0.537585
In [28]:
df.groupby(['key1', 'key2'])[['data2']].mean()
Out[28]:
              data2
key1 key2          
a    one   1.319920
     two   0.092908
b    one   0.281746
     two   0.769023
In [29]:
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped
Out[29]:
<pandas.core.groupby.SeriesGroupBy object at 0x05FE6270>
In [30]:
s_grouped.mean()
Out[30]:
key1  key2
a     one     1.319920
      two     0.092908
b     one     0.281746
      two     0.769023
Name: data2, dtype: float64
 

通过字典或Series进行分组

Grouping with dicts and Series

In [17]:
people = DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.ix[2:3, ['b', 'c']] = np.nan # Add a few NA values
people
Out[17]:
 abcde
Joe -0.622144 0.958238 0.794095 -0.755709 0.652060
Steve 0.343580 -0.097670 0.146899 -1.194148 -0.280604
Wes 0.513673 NaN NaN 0.573181 -1.392916
Jim -0.454878 0.476057 0.576800 0.647174 0.428221
Travis -0.073660 0.386551 1.509435 0.614138 0.887819
In [32]:
'假设已经知道了列的分组信息'
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}
In [33]:
'只需要将这个字典传给 groupby 即可'
by_column = people.groupby(mapping, axis=1)
by_column.sum()
Out[33]:
            blue       red
Joe     0.503905  1.063885
Steve   1.297183 -1.553778
Wes    -1.021228 -1.116829
Jim     0.524712  1.770545
Travis -4.230992 -2.405455
In [34]:
'Series也有同样的功能,它可以被看作一个固定大小的映射。如果用Series作为分组键,则pandas会检查Series以确保其索引跟分组轴是对齐的'
map_series = Series(mapping)
map_series
Out[34]:
a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object
In [35]:
people.groupby(map_series, axis=1).count()
Out[35]:
        blue  red
Joe        2    3
Steve      2    3
Wes        1    2
Jim        2    3
Travis     2    3
 

通过函数进行分组

Grouping with functions

In [36]:
'任何被当作分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称'
people.groupby(len).sum()
Out[36]:
          a         b         c         d         e
3  0.591569 -0.993608  0.798764 -0.791374  2.119639
5  0.886429 -2.001637 -0.371843  1.669025 -0.438570
6 -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
In [37]:
'将函数跟数组,列表,字典,Series混合使用也不是问题,因为任何东西最终都会被转换为数组'
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()
Out[37]:
              a         b         c         d         e
3 one -0.539741 -1.296221  0.274992 -1.021228 -0.577087
  two  0.124121  0.302614  0.523772  0.000940  1.343810
5 one  0.886429 -2.001637 -0.371843  1.669025 -0.438570
6 two -0.713544 -0.831154 -2.370232 -1.860761 -0.860757
 

根据索引级别分组

Grouping by index levels

In [38]:
'层次化索引数据集最方便的地方就在于它能够根据索引级别进行聚合。要实现该目的,通过level关键字传入级别编号或名称即可:'
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]], names=['cty', 'tenor'])
hier_df = DataFrame(np.random.randn(4, 5), columns=columns)
hier_df
Out[38]:
cty          US                            JP          
tenor         1         3         5         1         3
0      0.560145 -1.265934  0.119827 -1.063512  0.332883
1     -2.359419 -0.199543 -1.541996 -0.970736 -1.307030
2      0.286350  0.377984 -0.753887  0.331286  1.349742
3      0.069877  0.246674 -0.011862  1.004812  1.327195
In [39]:
hier_df.groupby(level='cty', axis=1).count()
Out[39]:
cty  JP  US
0     2   3
1     2   3
2     2   3
3     2   3
 

数据聚合

Data aggregation
聚合,指的是任何能够从数组产生标量值的数据转换过程。之前的例子已经用过一些,如mean,count,min,以及sum等。然而,并不是只能使用这些方法。你可以使用自己发明的聚合运算,还可以调用分组对象上已经定义好的任何方法。例如:quantile

In [40]:
df
Out[40]:
      data1     data2 key1 key2
0 -0.204708  1.393406    a  one
1  0.478943  0.092908    a  two
2 -0.519439  0.281746    b  one
3 -0.555730  0.769023    b  two
4  1.965781  1.246435    a  one
In [41]:
'quantile:计算Series或DataFrame列的样本分位数'
'''虽然quantile并没有明确地实现与GroupBy,但他是一个Series方法,所以这里是能用的。实际上,GroupBy会高效地对Series进行切片,然后对
各片调用piece.quantile(0.9),最后将这些结果组装成最终结果。'''
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
Out[41]:
key1
a    1.668413
b   -0.523068
Name: data1, dtype: float64
In [42]:
'如果要使用你自己的聚合函数,只需将其传入aggregate或agg方法即可'
def peak_to_peak(arr):
    return arr.max() - arr.min()
grouped.agg(peak_to_peak)
Out[42]:
         data1     data2
key1                    
a     2.170488  1.300498
b     0.036292  0.487276
In [43]:
'有些方法(如describe)也是可以用在这里的,即使严格来讲,他们并非聚合运算'
grouped.describe()
Out[43]:
               data1     data2
key1                          
a    count  3.000000  3.000000
     mean   0.746672  0.910916
     std    1.109736  0.712217
     min   -0.204708  0.092908
     25%    0.137118  0.669671
     50%    0.478943  1.246435
     75%    1.222362  1.319920
     max    1.965781  1.393406
b    count  2.000000  2.000000
     mean  -0.537585  0.525384
     std    0.025662  0.344556
     min   -0.555730  0.281746
     25%   -0.546657  0.403565
     50%   -0.537585  0.525384
     75%   -0.528512  0.647203
     max   -0.519439  0.769023
 

在后面关于分组及运算和转换的那一节中,我将详细说明这到底是怎么回事。
注:自定义聚合函数要比表9-1中那些经过优化的函数慢得多。这是因为在构造中间分组数据块时存在非常大的开销(函数调用,数据重排等)。

 

表9-1:经过优化的groupby的方法

函数名        说明
count      分组中非NA值得数量
sum        非NA值的和
mean       非NA值的平均值
median     非NA值的算术中位数
std,var    无偏(分母为n-1)标准差和方差
min,max    非NA值的最小值和最大值
prod       非NA值的积
first,last 第一个和最后一个非NA值
In [4]:
'这段代码显示了下节的高级的聚合功能所需要的初始化数据'
tips = pd.read_csv('ch08/tips.csv')
# Add tip percentage of total bill
tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips[:6]
Out[4]:
   total_bill   tip     sex smoker  day    time  size   tip_pct
0       16.99  1.01  Female     No  Sun  Dinner     2  0.059447
1       10.34  1.66    Male     No  Sun  Dinner     3  0.160542
2       21.01  3.50    Male     No  Sun  Dinner     3  0.166587
3       23.68  3.31    Male     No  Sun  Dinner     2  0.139780
4       24.59  3.61  Female     No  Sun  Dinner     4  0.146808
5       25.29  4.71    Male     No  Sun  Dinner     4  0.186240
 

面向列的多函数应用

Column-wise and multiple function application

In [35]:
grouped = tips.groupby(['sex', 'smoker'])
In [36]:
grouped_pct = grouped['tip_pct']
'对于向 .agg 传入表9-1那样的函数,必须将函数名以字符串的形式传入,否则将认为是自定义的函数:'
grouped_pct.agg('mean')
Out[36]:
sex     smoker
Female  No        0.156921
        Yes       0.182150
Male    No        0.160669
        Yes       0.152771
Name: tip_pct, dtype: float64
In [37]:
'因此,下面代码将会出错!'
grouped_pct.agg(mean)
 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-37-abfd7fcf5de9> in <module>()
----> 1grouped_pct.agg(mean)

NameError: name 'mean' is not defined
In [48]:
'如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名:注意,这时候传入lambda函数就惨了,他们的名称会是一样'
grouped_pct.agg(['mean', 'std', peak_to_peak])
Out[48]:
                   mean       std  peak_to_peak
sex    smoker                                  
Female No      0.156921  0.036421      0.195876
       Yes     0.182150  0.071595      0.360233
Male   No      0.160669  0.041849      0.220186
       Yes     0.152771  0.090588      0.674707
In [49]:
'如果传入的是一个由(name, function)元组组成的列表,name就会作为Dataframe的列名(可以将这种二元元组列表看作一个有序映射)'
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
Out[49]:
                    foo       bar
sex    smoker                    
Female No      0.156921  0.036421
       Yes     0.182150  0.071595
Male   No      0.160669  0.041849
       Yes     0.152771  0.090588
In [50]:
"对多列传入多个函数(前面是对单个列传入函数),那么每个列都会执行这些函数(聚合),返回一个层次化列的Dataframe"
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result
Out[50]:
              tip_pct                     total_bill                  
                count      mean       max      count       mean    max
sex    smoker                                                         
Female No          54  0.156921  0.252672         54  18.105185  35.83
       Yes         33  0.182150  0.416667         33  17.977879  44.30
Male   No          97  0.160669  0.291990         97  19.791237  48.33
       Yes         60  0.152771  0.710345         60  22.284500  50.81
In [51]:
result['tip_pct']
Out[51]:
               count      mean       max
sex    smoker                           
Female No         54  0.156921  0.252672
       Yes        33  0.182150  0.416667
Male   No         97  0.160669  0.291990
       Yes        60  0.152771  0.710345
In [52]:
'也可以传入带有自定义名称的元组列表'
ftuples = [('Durchschnitt', 'mean'), ('Abweichung', np.var)]
grouped['tip_pct', 'total_bill'].agg(ftuples)
Out[52]:
                   tip_pct              total_bill           
              Durchschnitt Abweichung Durchschnitt Abweichung
sex    smoker                                                
Female No         0.156921   0.001327    18.105185  53.092422
       Yes        0.182150   0.005126    17.977879  84.451517
Male   No         0.160669   0.001751    19.791237  76.152961
       Yes        0.152771   0.008206    22.284500  98.244673
In [53]:
'传入一个从列名映射到函数的字典:对不同的列应用不同的函数'
grouped.agg({'tip' : np.max, 'size' : 'sum'})
Out[53]:
                tip  size
sex    smoker            
Female No       5.2   140
       Yes      6.5    74
Male   No       9.0   263
       Yes     10.0   150
In [54]:
grouped.agg({'tip_pct' : ['min', 'max', 'mean', 'std'],
             'size' : 'sum'})
Out[54]:
                tip_pct                               size
                    min       max      mean       std  sum
sex    smoker                                             
Female No      0.056797  0.252672  0.156921  0.036421  140
       Yes     0.056433  0.416667  0.182150  0.071595   74
Male   No      0.071804  0.291990  0.160669  0.041849  263
       Yes     0.035638  0.710345  0.152771  0.090588  150
 

从这里也可以看出,agg()可以实现对分组后的每一项用自定义的函数实现聚合,而之前的sum(),count()等仅仅是单一的聚合

 

以无索引的形式返回聚合数据

Returning aggregated data in "unindexed" form

In [32]:
'as_index=False: 不会把传入的数组作为索引'
tips.groupby(['sex', 'smoker'], as_index=False).mean()
Out[32]:
 sexsmokertotal_billtipsizetip_pct
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771
 

注意:as_index=True 对应的是agg()函数,对表9-1的内置函数也成立,而group_keys=True 对应的是apply()函数,如下举例说明:

In [7]:
'上例 as_index=False 对内置函数成立,这里对 agg()函数也成立'
tips.groupby(['sex', 'smoker'], as_index=False).agg('mean')
Out[7]:
      sex smoker  total_bill       tip      size   tip_pct
0  Female     No   18.105185  2.773519  2.592593  0.156921
1  Female    Yes   17.977879  2.931515  2.242424  0.182150
2    Male     No   19.791237  3.113402  2.711340  0.160669
3    Male    Yes   22.284500  3.051167  2.500000  0.152771
In [6]:
'上例 as_index=False 对内置函数成立,这里对 agg()函数也成立'
'这里的np.mean对 agg 的 as_index=False有效,下面好像对 apply的 groups_keys无效,不知道为什么。'
tips.groupby(['sex', 'smoker'], as_index=False).agg(np.mean)
Out[6]:
      sex smoker  total_bill       tip      size   tip_pct
0  Female     No   18.105185  2.773519  2.592593  0.156921
1  Female    Yes   17.977879  2.931515  2.242424  0.182150
2    Male     No   19.791237  3.113402  2.711340  0.160669
3    Male    Yes   22.284500  3.051167  2.500000  0.152771
In [40]:
'as_index=False 对apply()函数无效'
tips.groupby(['sex', 'smoker'], as_index=False).apply(np.mean)
Out[40]:
  total_billtipsizetip_pct
sexsmoker    
FemaleNo 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
MaleNo 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771
In [8]:
####???? 不知道这里为什么没有效果,而下面的就有效果。以后再研究
'这里不知道为什么这样,但是不管他了。下面的reset_index()也有解决办法'
tips.groupby(['sex', 'smoker'], group_keys=False).apply(np.mean)
Out[8]:
               total_bill       tip      size   tip_pct
sex    smoker                                          
Female No       18.105185  2.773519  2.592593  0.156921
       Yes      17.977879  2.931515  2.242424  0.182150
Male   No       19.791237  3.113402  2.711340  0.160669
       Yes      22.284500  3.051167  2.500000  0.152771
In [14]:
####???? 不知道这里为什么没有效果,而下面的就有效果。以后再研究
'这里不知道为什么这样,但是不管他了。下面的reset_index()也有解决办法'
tips.groupby(['sex', 'smoker'], group_keys=False).apply(lambda x: x.mean())
Out[14]:
               total_bill       tip      size   tip_pct
sex    smoker                                          
Female No       18.105185  2.773519  2.592593  0.156921
       Yes      17.977879  2.931515  2.242424  0.182150
Male   No       19.791237  3.113402  2.711340  0.160669
       Yes      22.284500  3.051167  2.500000  0.152771
In [63]:
'这样就好了'
tips.groupby(['sex', 'smoker'], group_keys=False).apply(np.mean).reset_index()
Out[63]:
 sexsmokertotal_billtipsizetip_pct
0 Female No 18.105185 2.773519 2.592593 0.156921
1 Female Yes 17.977879 2.931515 2.242424 0.182150
2 Male No 19.791237 3.113402 2.711340 0.160669
3 Male Yes 22.284500 3.051167 2.500000 0.152771
In [56]:
'group_keys=False对 apply有效。'
tips.groupby(['sex', 'smoker'], group_keys=False).apply(top)
Out[56]:
 total_billtipsexsmokerdaytimesizetip_pct
115 17.31 3.50 Female No Sun Dinner 2 0.202195
14 14.83 3.02 Female No Sun Dinner 2 0.203641
18 16.97 3.50 Female No Sun Dinner 3 0.206246
139 13.16 2.75 Female No Thur Lunch 2 0.208967
51 10.29 2.60 Female No Sun Dinner 2 0.252672
221 13.42 3.48 Female Yes Fri Lunch 2 0.259314
93 16.32 4.30 Female Yes Fri Dinner 2 0.263480
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
6 8.77 2.00 Male No Sun Dinner 2 0.228050
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
174 16.82 4.00 Male Yes Sun Dinner 2 0.237812
194 16.58 4.00 Male Yes Thur Lunch 2 0.241255
181 23.33 5.65 Male Yes Sun Dinner 2 0.242177
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
In [17]:
tips.groupby(['sex', 'smoker'], group_keys=True).apply(top)
Out[17]:
                   total_bill   tip     sex smoker   day    time  size  \
sex    smoker                                                            
Female No     115       17.31  3.50  Female     No   Sun  Dinner     2   
              14        14.83  3.02  Female     No   Sun  Dinner     2   
              18        16.97  3.50  Female     No   Sun  Dinner     3   
              139       13.16  2.75  Female     No  Thur   Lunch     2   
              51        10.29  2.60  Female     No   Sun  Dinner     2   
       Yes    221       13.42  3.48  Female    Yes   Fri   Lunch     2   
              93        16.32  4.30  Female    Yes   Fri  Dinner     2   
              109       14.31  4.00  Female    Yes   Sat  Dinner     2   
              67         3.07  1.00  Female    Yes   Sat  Dinner     1   
              178        9.60  4.00  Female    Yes   Sun  Dinner     2   
Male   No     6          8.77  2.00    Male     No   Sun  Dinner     2   
              88        24.71  5.85    Male     No  Thur   Lunch     2   
              185       20.69  5.00    Male     No   Sun  Dinner     5   
              149        7.51  2.00    Male     No  Thur   Lunch     2   
              232       11.61  3.39    Male     No   Sat  Dinner     2   
       Yes    174       16.82  4.00    Male    Yes   Sun  Dinner     2   
              194       16.58  4.00    Male    Yes  Thur   Lunch     2   
              181       23.33  5.65    Male    Yes   Sun  Dinner     2   
              183       23.17  6.50    Male    Yes   Sun  Dinner     4   
              172        7.25  5.15    Male    Yes   Sun  Dinner     2   

                    tip_pct  
sex    smoker                
Female No     115  0.202195  
              14   0.203641  
              18   0.206246  
              139  0.208967  
              51   0.252672  
       Yes    221  0.259314  
              93   0.263480  
              109  0.279525  
              67   0.325733  
              178  0.416667  
Male   No     6    0.228050  
              88   0.236746  
              185  0.241663  
              149  0.266312  
              232  0.291990  
       Yes    174  0.237812  
              194  0.241255  
              181  0.242177  
              183  0.280535  
              172  0.710345  
In [59]:
'group_keys=False对 agg()无效'
tips.groupby(['sex','smoker'], group_keys=False).agg('mean')
Out[59]:
  total_billtipsizetip_pct
sexsmoker    
FemaleNo 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
MaleNo 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771
In [60]:
'group_keys=False 对 表9-1 的内置函数也无效'
tips.groupby(['sex','smoker'], group_keys=False).mean()
Out[60]:
  total_billtipsizetip_pct
sexsmoker    
FemaleNo 18.105185 2.773519 2.592593 0.156921
Yes 17.977879 2.931515 2.242424 0.182150
MaleNo 19.791237 3.113402 2.711340 0.160669
Yes 22.284500 3.051167 2.500000 0.152771
 

当然,对结果调用reset_index:(把index转化为columns的函数)也能得到这种形式的结果。
警告:groupby的这种用法比较缺乏灵活性

 

分组级运算和转换

Group-wise operations and transformations
聚合只不过是分组运算的其中一种而已。他是数据转换的一个特例,也就是说,他接受能够将一维数组简化为标量值的函数。在本节中,我将介绍transform和apply方法,他们能够执行更多其他的分组运算。

In [56]:
df
Out[56]:
      data1     data2 key1 key2
0 -0.204708  1.393406    a  one
1  0.478943  0.092908    a  two
2 -0.519439  0.281746    b  one
3 -0.555730  0.769023    b  two
4  1.965781  1.246435    a  one
In [57]:
k1_means = df.groupby('key1').mean().add_prefix('mean_')
k1_means
Out[57]:
      mean_data1  mean_data2
key1                        
a       0.746672    0.910916
b      -0.537585    0.525384
In [58]:
'左边用 列key1, 右边用 索引 合并'
pd.merge(df, k1_means, left_on='key1', right_index=True)
Out[58]:
      data1     data2 key1 key2  mean_data1  mean_data2
0 -0.204708  1.393406    a  one    0.746672    0.910916
1  0.478943  0.092908    a  two    0.746672    0.910916
4  1.965781  1.246435    a  one    0.746672    0.910916
2 -0.519439  0.281746    b  one   -0.537585    0.525384
3 -0.555730  0.769023    b  two   -0.537585    0.525384
In [18]:
key = ['one', 'two', 'one', 'two', 'one']
people.groupby(key).mean()
Out[18]:
 abcde
one -0.060710 0.672395 1.151765 0.143870 0.048987
two -0.055649 0.189193 0.361850 -0.273487 0.073809
In [19]:
'transform会将一个函数应用到各个分组,然后将结果放置到适当的位置上。如果各分组产生的是一个标量值,则该值就会被广播出去。'
people.groupby(key).transform(np.mean)
Out[19]:
 abcde
Joe -0.060710 0.672395 1.151765 0.143870 0.048987
Steve -0.055649 0.189193 0.361850 -0.273487 0.073809
Wes -0.060710 0.672395 1.151765 0.143870 0.048987
Jim -0.055649 0.189193 0.361850 -0.273487 0.073809
Travis -0.060710 0.672395 1.151765 0.143870 0.048987
In [70]:
'上面得出的值有重复的,因此,用agg函数不能得出上面的结果,尽管 使用 as_index=False参数'
'可以这么认为,transform是先分组,然后再还原。而agg只可以分组,因此只要len(分组index) < len(还原index) agg结果就不等于transform'
people.groupby(key, as_index=False).agg('mean')
Out[70]:
 abcde
0 -0.060710 0.672395 1.151765 0.143870 0.048987
1 -0.055649 0.189193 0.361850 -0.273487 0.073809
In [72]:
'从各组中减去平均值'
def demean(arr):
    return arr - arr.mean()
demeaned = people.groupby(key).transform(demean)
demeaned
Out[72]:
 abcde
Joe -0.561434 0.285843 -0.35767 -0.899579 0.603072
Steve 0.399229 -0.286864 -0.21495 -0.920661 -0.354413
Wes 0.574384 NaN NaN 0.429311 -1.441904
Jim -0.399229 0.286864 0.21495 0.920661 0.354413
Travis -0.012950 -0.285843 0.35767 0.470268 0.838831
In [78]:
#为什么应用apply结果一样? 猜测:demean函数返回的是 DataFrame数据,不是单个的值,即对apply使用demean没有聚合效果。
'原因可能是各分组应用apply后返回的是一个Dataframe,然后对这些DataFrame合并,最终结果是看起来就像没有分组一样。'
'适用于apply 的函数既可以是用于分组的函数,如 np.mean,也可以是没有分组效果但是按分组运算的函数,如本例的 demean '
demeaned = people.groupby(key).apply(demean)
demeaned
Out[78]:
 abcde
Joe -0.561434 0.285843 -0.35767 -0.899579 0.603072
Steve 0.399229 -0.286864 -0.21495 -0.920661 -0.354413
Wes 0.574384 NaN NaN 0.429311 -1.441904
Jim -0.399229 0.286864 0.21495 0.920661 0.354413
Travis -0.012950 -0.285843 0.35767 0.470268 0.838831
In [79]:
#这里的函数为什么对 agg 不能用?
'原因可能是适用于agg 的函数必须是一个有分组作用的函数。这里的 demean 函数显然没有分组的作用。'
demeaned = people.groupby(key).agg(demean)
demeaned
 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in agg_series(self, obj, func)
   1730         try:
-> 1731return self._aggregate_series_fast(obj, func)
   1732         except Exception:

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _aggregate_series_fast(self, obj, func)
   1749                                     dummy)
-> 1750result, counts = grouper.get_result()
   1751         return result, counts

pandas\src\reduce.pyx in pandas.lib.SeriesGrouper.get_result (pandas\lib.c:39025)()

pandas\src\reduce.pyx in pandas.lib.SeriesGrouper.get_result (pandas\lib.c:38894)()

pandas\src\reduce.pyx in pandas.lib._get_result_array (pandas\lib.c:32414)()

ValueError: function does not reduce

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2437             try:
-> 2438return self._python_agg_general(func_or_funcs, *args, **kwargs)
   2439             except Exception:

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _python_agg_general(self, func, *args, **kwargs)
   1226             try:
-> 1227result, counts = self.grouper.agg_series(obj, f)
   1228                 output[name] = self._try_cast(result, obj)

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in agg_series(self, obj, func)
   1732         except Exception:
-> 1733return self._aggregate_series_pure_python(obj, func)
   1734 

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _aggregate_series_pure_python(self, obj, func)
   1766                         isinstance(res, list)):
-> 1767raise ValueError('Function does not reduce')
   1768                 result = np.empty(ngroups, dtype='O')

ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in aggregate(self, arg, *args, **kwargs)
   2983                     assert not args and not kwargs
-> 2984result = self._aggregate_multiple_funcs([arg])
   2985                     result.columns = Index(result.columns.levels[0],

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _aggregate_multiple_funcs(self, arg)
   3009                                      grouper=self.grouper)
-> 3010results.append(colg.aggregate(arg))
   3011                 keys.append(col)

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2427         if hasattr(func_or_funcs, '__iter__'):
-> 2428ret = self._aggregate_multiple_funcs(func_or_funcs)
   2429         else:

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _aggregate_multiple_funcs(self, arg)
   2476 
-> 2477results[name] = self.aggregate(func)
   2478 

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2439             except Exception:
-> 2440result = self._aggregate_named(func_or_funcs, *args, **kwargs)
   2441 

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _aggregate_named(self, func, *args, **kwargs)
   2537             if isinstance(output, (Series, Index, np.ndarray)):
-> 2538raise Exception('Must produce aggregated value')
   2539             result[name] = self._try_cast(output, group)

Exception: Must produce aggregated value

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
   2007                 result = np.empty(len(values), dtype=object)
-> 2008result[:] = values
   2009             except ValueError:

ValueError: could not broadcast input array from shape (3,5) into shape (3)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-79-bf643fd97fec> in <module>()
      1 # 这里的函数为什么对 agg 和 apply 不能用?
----> 2demeaned = people.groupby(key).agg(demean)
      3 demeaned

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in agg(self, func, *args, **kwargs)
    727     @Appender(_agg_doc)
    728     def agg(self, func, *args, **kwargs):
--> 729return self.aggregate(func, *args, **kwargs)
    730 
    731     def _iterate_slices(self):

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in aggregate(self, arg, *args, **kwargs)
   2986                                            name=self._selected_obj.columns.name)
   2987                 except:
-> 2988result = self._aggregate_generic(arg, *args, **kwargs)
   2989 
   2990         if not self.as_index:

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _aggregate_generic(self, func, *args, **kwargs)
   3043                     result[name] = data.apply(wrapper, axis=axis)
   3044 
-> 3045return self._wrap_generic_output(result, obj)
   3046 
   3047     def _wrap_aggregated_output(self, output, names=None):

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\groupby.py in _wrap_generic_output(self, result, obj)
   3500             if self.axis == 0:
   3501                 result = DataFrame(result, index=obj.columns,
-> 3502                                    columns=result_index).T
   3503             else:
   3504                 result = DataFrame(result, index=obj.index,

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    224                                  dtype=dtype, copy=copy)
    225         elif isinstance(data, dict):
--> 226mgr = self._init_dict(data, index, columns, dtype=dtype)
    227         elif isinstance(data, ma.MaskedArray):
    228             import numpy.ma.mrecords as mrecords

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\frame.py in _init_dict(self, data, index, columns, dtype)
    361 
    362         return _arrays_to_mgr(arrays, data_names, index, columns,
--> 363                               dtype=dtype)
    364 
    365     def _init_ndarray(self, values, index, columns, dtype=None,

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5161 
   5162     # don't force copy because getting jammed in an ndarray anyway
-> 5163arrays = _homogenize(arrays, index, dtype)
   5164 
   5165     # from BlockManager perspective

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\frame.py in _homogenize(data, index, dtype)
   5475                 v = lib.fast_multiget(v, oindex.values, default=NA)
   5476             v = _sanitize_array(v, index, dtype=dtype, copy=False,
-> 5477                                 raise_cast_failure=False)
   5478 
   5479         homogenized.append(v)

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   2885             raise Exception('Data must be 1-dimensional')
   2886         else:
-> 2887subarr = _asarray_tuplesafe(data, dtype=dtype)
   2888 
   2889     # This is to prevent mixed-type Series getting all casted to

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\common.py in _asarray_tuplesafe(values, dtype)
   2009             except ValueError:
   2010                 # we have a list-of-list
-> 2011result[:] = [tuple(x) for x in values]
   2012 
   2013     return result

ValueError: cannot copy sequence with size 5 to array axis with dimension 3
In [62]:
demeaned.groupby(key).mean()
Out[62]:
                a             b  c             d  e
one  0.000000e+00 -1.110223e-16  0  7.401487e-17  0
two -2.775558e-17  0.000000e+00  0  0.000000e+00  0
 

apply:一般性的“拆分-应用-合并”

Apply: General split-apply-combine
和aggregate一样,transform也是一个有着严格条件的特殊函数:传入的函数只能产生两种结果,要么产生一个可以广播的标量值(如np.mean),要么产生一个相同大小的结果数组。最一般的GroupBy方法是apply,本节剩余部分将重点讲解它。apply会将待处理的对象拆分成多个片段,然后对各片段调用传入的函数,最后尝试将各片段组合到一起。

In [16]:
'根据 tip_pct 列中的最高的前5个值选出来行。'
def top(df, n=5, column='tip_pct'):
    return df.sort_values(by=column)[-n:]
top(tips, n=6)
Out[16]:
     total_bill   tip     sex smoker  day    time  size   tip_pct
109       14.31  4.00  Female    Yes  Sat  Dinner     2  0.279525
183       23.17  6.50    Male    Yes  Sun  Dinner     4  0.280535
232       11.61  3.39    Male     No  Sat  Dinner     2  0.291990
67         3.07  1.00  Female    Yes  Sat  Dinner     1  0.325733
178        9.60  4.00  Female    Yes  Sun  Dinner     2  0.416667
172        7.25  5.15    Male    Yes  Sun  Dinner     2  0.710345
In [70]:
'根据 smoker 分组并用该函数调用apply 注意:这里默认把分组数据传递给了函数 top 作为第一个带数据的参数(就是这里的df)'
tips.groupby('smoker').apply(top)
'原理:top函数在DataFrame的各个片段上调用,然后结果由pandas.concat 组装到一起,并以分组名称进行了标记。于是,结果就有了一个层次化索引。'
Out[70]:
            total_bill   tip     sex smoker   day    time  size   tip_pct
smoker                                                                   
No     88        24.71  5.85    Male     No  Thur   Lunch     2  0.236746
       185       20.69  5.00    Male     No   Sun  Dinner     5  0.241663
       51        10.29  2.60  Female     No   Sun  Dinner     2  0.252672
       149        7.51  2.00    Male     No  Thur   Lunch     2  0.266312
       232       11.61  3.39    Male     No   Sat  Dinner     2  0.291990
Yes    109       14.31  4.00  Female    Yes   Sat  Dinner     2  0.279525
       183       23.17  6.50    Male    Yes   Sun  Dinner     4  0.280535
       67         3.07  1.00  Female    Yes   Sat  Dinner     1  0.325733
       178        9.60  4.00  Female    Yes   Sun  Dinner     2  0.416667
       172        7.25  5.15    Male    Yes   Sun  Dinner     2  0.710345
In [71]:
'也可以在apply中传入 top函数 的其它参数或者关键字'
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
'除了这些基本的用法之外,能否充分发挥apply的威力很大程度上取决于你的创造力。传入的函数完全由你说的算,他只需要返回一个pandas对象或标量值即可'
Out[71]:
                 total_bill    tip     sex smoker   day    time  size  \
smoker day                                                              
No     Fri  94        22.75   3.25  Female     No   Fri  Dinner     2   
       Sat  212       48.33   9.00    Male     No   Sat  Dinner     4   
       Sun  156       48.17   5.00    Male     No   Sun  Dinner     6   
       Thur 142       41.19   5.00    Male     No  Thur   Lunch     5   
Yes    Fri  95        40.17   4.73    Male    Yes   Fri  Dinner     4   
       Sat  170       50.81  10.00    Male    Yes   Sat  Dinner     3   
       Sun  182       45.35   3.50    Male    Yes   Sun  Dinner     3   
       Thur 197       43.11   5.00  Female    Yes  Thur   Lunch     4   

                  tip_pct  
smoker day                 
No     Fri  94   0.142857  
       Sat  212  0.186220  
       Sun  156  0.103799  
       Thur 142  0.121389  
Yes    Fri  95   0.117750  
       Sat  170  0.196812  
       Sun  182  0.077178  
       Thur 197  0.115982  
In [72]:
result = tips.groupby('smoker')['tip_pct'].describe()
result
Out[72]:
smoker       
No      count    151.000000
        mean       0.159328
        std        0.039910
        min        0.056797
        25%        0.136906
        50%        0.155625
        75%        0.185014
        max        0.291990
Yes     count     93.000000
        mean       0.163196
        std        0.085119
        min        0.035638
        25%        0.106771
        50%        0.153846
        75%        0.195059
        max        0.710345
dtype: float64
In [73]:
result.unstack('smoker')
Out[73]:
smoker          No        Yes
count   151.000000  93.000000
mean      0.159328   0.163196
std       0.039910   0.085119
min       0.056797   0.035638
25%       0.136906   0.106771
50%       0.155625   0.153846
75%       0.185014   0.195059
max       0.291990   0.710345
 

在GroupBy中,当你调用诸如describe之类的方法时,实际上只是应用了下面两条代码的快捷方式而已

f = lambda x: x.describe()
        grouped.apply(f)
        print()
 

禁止分组键

Suppressing the group keys

In [51]:
'gruop_keys=False 禁止根据传入的分组键 smoker ,sex 与 原始对象的索引 共同构成层次化索引'
tips.groupby(['smoker','sex'], group_keys=False).apply(top)
Out[51]:
 total_billtipsexsmokerdaytimesizetip_pct
115 17.31 3.50 Female No Sun Dinner 2 0.202195
14 14.83 3.02 Female No Sun Dinner 2 0.203641
18 16.97 3.50 Female No Sun Dinner 3 0.206246
139 13.16 2.75 Female No Thur Lunch 2 0.208967
51 10.29 2.60 Female No Sun Dinner 2 0.252672
6 8.77 2.00 Male No Sun Dinner 2 0.228050
88 24.71 5.85 Male No Thur Lunch 2 0.236746
185 20.69 5.00 Male No Sun Dinner 5 0.241663
149 7.51 2.00 Male No Thur Lunch 2 0.266312
232 11.61 3.39 Male No Sat Dinner 2 0.291990
221 13.42 3.48 Female Yes Fri Lunch 2 0.259314
93 16.32 4.30 Female Yes Fri Dinner 2 0.263480
109 14.31 4.00 Female Yes Sat Dinner 2 0.279525
67 3.07 1.00 Female Yes Sat Dinner 1 0.325733
178 9.60 4.00 Female Yes Sun Dinner 2 0.416667
174 16.82 4.00 Male Yes Sun Dinner 2 0.237812
194 16.58 4.00 Male Yes Thur Lunch 2 0.241255
181 23.33 5.65 Male Yes Sun Dinner 2 0.242177
183 23.17 6.50 Male Yes Sun Dinner 4 0.280535
172 7.25 5.15 Male Yes Sun Dinner 2 0.710345
 

数据聚合中的aggregate(或agg),transform 与 apply之间的联系

  1. aggregate(或agg):
    • 传入单个函数: grouped[col1,col2].agg(func)
    • 传入函数组成的列表:grouped[col1,col2].agg([func1,func2])
    • 传入(name,func)组成的列表:grouped[col1,col2].agg([(name1,func1),(name2,func2)]):输出结果str(name1)代替str(func1)
    • 传入字典:grouped.agg({col1:[func1,func2],col2:fnc3}):对不同的列应用不同的函数注意这里的 col的位置
    • 对表9-1的内建函数函数func,必须传入函数名str(func),否则被认为是自定义的函数
    • 对transform的第二种形式的函数不支持(因为它没有group效果)。
  2. transform:
    • 传入的函数可以具有分组效果,如np.mean,那么结果是agg(np.mean)的展开。
    • 传入的函数也可以是有分组功能但是没有分组效果的,如 lambda x: x - x.mean 。返回的没有分组效果但有分组运算的DataFrame
  3. apply:
    • apply(func),对func先分组运算,然后再叠加起来,因此对transform 的两种函数都支持。
    • apply(func, *args, ** kwargs),如apply(top, n=1, column='total_bill') 后面两个是 top 的参数
    • 传入 lambda x:{'min': x.min(), 'max': x.max(), 'count': x.count(), 'mean': x.mean()} 对每一组返回4行group数据

总结:transform必须展开,agg不能展开,apply可以展开(传入transform的第二种函数)。
apply是具有一般性的函数,agg可以支持group的内置函数

 

分位数和桶分析

Quantile and bucket analysis
将pandas的一些根据指定面元或样本分位数将数据拆分成多块的工具(比如cut和qcut)。将这些函数与groupby结合起来,就能非常轻松地实现对数据集的桶(bucket)或分位数(quantile)分析了。

In [9]:
frame = DataFrame({'data1': np.random.randn(1000),
                   'data2': np.random.randn(1000)})
factor = pd.cut(frame.data1, 4)
factor[:10]
Out[9]:
0    (-1.23, 0.489]
1    (-1.23, 0.489]
2    (-1.23, 0.489]
3    (-1.23, 0.489]
4    (0.489, 2.208]
5    (0.489, 2.208]
6    (-1.23, 0.489]
7    (-1.23, 0.489]
8    (0.489, 2.208]
9    (0.489, 2.208]
Name: data1, dtype: category
Categories (4, object): [(-2.956, -1.23] < (-1.23, 0.489] < (0.489, 2.208] < (2.208, 3.928]]
In [15]:
def get_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}

grouped = frame.data2.groupby(factor)
grouped.apply(get_stats).unstack()

#ADAPT the output is not sorted in the book while this is the case now (swap first two lines)
Out[15]:
                 count       max      mean       min
data1                                               
(-2.956, -1.23]     95  1.670835 -0.060603 -3.399312
(-1.23, 0.489]     595  3.260383 -0.010543 -2.989741
(0.489, 2.208]     299  2.954439  0.092054 -3.745356
(2.208, 3.928]      11  1.765640  0.030607 -1.929776
In [18]:
grouped.apply(get_stats)
Out[18]:
data1                 
(-2.956, -1.23]  count     95.000000
                 max        1.670835
                 mean      -0.060603
                 min       -3.399312
(-1.23, 0.489]   count    595.000000
                 max        3.260383
                 mean      -0.010543
                 min       -2.989741
(0.489, 2.208]   count    299.000000
                 max        2.954439
                 mean       0.092054
                 min       -3.745356
(2.208, 3.928]   count     11.000000
                 max        1.765640
                 mean       0.030607
                 min       -1.929776
dtype: float64
In [11]:
# 返回分位数编号  Return quantile numbers
grouping = pd.qcut(frame.data1, 10, labels=False)

grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
Out[11]:
       count       max      mean       min
data1                                     
0        100  1.670835 -0.069930 -3.399312
1        100  2.628441  0.043954 -1.801179
2        100  2.527939 -0.090777 -2.925113
3        100  3.260383  0.080316 -2.315555
4        100  2.074345 -0.126442 -2.041696
5        100  2.184810  0.009929 -2.989741
6        100  2.458842 -0.040484 -2.084231
7        100  2.954439  0.049913 -3.056990
8        100  2.735527  0.108133 -3.745356
9        100  2.377020  0.193693 -2.064111
 

实例:用特定于分组的值填充缺失值

Example: Filling missing values with group-specific values
这里用平均值去填充NA

In [19]:
'关于扩展的切片 list[I:J:K] 参见我的Python学习笔记'
s = Series(np.random.randn(6))
s[::2] = np.nan
s
Out[19]:
0         NaN
1    1.847446
2         NaN
3   -0.302450
4         NaN
5    1.212354
dtype: float64
In [20]:
'简单的填充平均值情况'
s.fillna(s.mean())
Out[20]:
0    0.919117
1    1.847446
2    0.919117
3   -0.302450
4    0.919117
5    1.212354
dtype: float64
In [21]:
'要对不同的分组填充不同的值,只需要将数据分组,并使用apply和一个能够对各数据块调用fillna的函数即可。'
states = ['Ohio', 'New York', 'Vermont', 'Florida',
          'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data
Out[21]:
Ohio         -1.613474
New York     -0.573966
Vermont            NaN
Florida       1.257544
Oregon       -1.065343
Nevada             NaN
California    1.680135
Idaho              NaN
dtype: float64
In [22]:
data.groupby(group_key).mean()
Out[22]:
East   -0.309965
West    0.307396
dtype: float64
In [23]:
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Out[23]:
Ohio         -1.613474
New York     -0.573966
Vermont      -0.309965
Florida       1.257544
Oregon       -1.065343
Nevada        0.307396
California    1.680135
Idaho         0.307396
dtype: float64
In [24]:
'传入以分组为键的字典,也可以根据分组传入自定义的填充值'
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])

data.groupby(group_key).apply(fill_func)
Out[24]:
Ohio         -1.613474
New York     -0.573966
Vermont       0.500000
Florida       1.257544
Oregon       -1.065343
Nevada       -1.000000
California    1.680135
Idaho        -1.000000
dtype: float64
 

示例:随即采样和排列

Example: Random sampling and permutation

In [3]:
# 红桃 Hearts, 黑桃 Spades, 梅花 Clubs, 方片 Diamonds
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in ['H', 'S', 'C', 'D']:
    cards.extend(str(num) + suit for num in base_names)

deck = Series(card_val, index=cards)
In [4]:
deck[:13]
Out[4]:
AH      1
2H      2
3H      3
4H      4
5H      5
6H      6
7H      7
8H      8
9H      9
10H    10
JH     10
KH     10
QH     10
dtype: int64
In [6]:
def draw(deck, n=5):
    return deck.take(np.random.permutation(len(deck))[:n])
draw(deck)
Out[6]:
3S    3
6C    6
4S    4
7S    7
8S    8
dtype: int64
In [7]:
'从每种花色中随机抽取两张牌。'
get_suit = lambda card: card[-1] # 注意:这里的card不是前面的card,它类似于未知数X 见下面的用法
deck.groupby(get_suit).apply(draw, n=2)
Out[7]:
C  QC    10
   4C     4
D  QD    10
   9D     9
H  JH    10
   9H     9
S  3S     3
   JS    10
dtype: int64
In [8]:
get_suit = lambda x: x[-1] # 默认条件(axis=0)下,func把index作为参数,参见前面groupby通过函数分组
deck.groupby(get_suit).apply(draw, n=2)
Out[8]:
C  JC     10
   4C      4
D  JD     10
   4D      4
H  10H    10
   KH     10
S  7S      7
   9S      9
dtype: int64
In [31]:
# alternatively
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)
Out[31]:
QC    10
8C     8
KD    10
JD    10
9H     9
4H     4
2S     2
QS    10
dtype: int64
 

分组加权平均数和相关系数

Example: Group weighted average and correlation

In [20]:
df = DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
                'data': np.random.randn(8),
                'weights': np.random.rand(8)})
df
Out[20]:
  category      data   weights
0        a  0.274992  0.994015
1        a  0.228913  0.676874
2        a  1.352917  0.790823
3        a  0.886429  0.170914
4        b -2.001637  0.026849
5        b -0.371843  0.800370
6        b  1.669025  0.903723
7        b -0.438570  0.024676
In [21]:
'先通过 category 进行分组,然后分组对 data 使用 weight 权重进行加权平均'
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg) # a = sum(df.data[:4] * df.weights[:4])/ sum(df.weights[:4])
Out[21]:
category
a    0.626641
b    0.652852
dtype: float64
 

上面的例子比较无聊,所以看一个稍微实际点的例子

In [25]:
close_px = pd.read_csv('ch09/stock_px.csv', parse_dates=True, index_col=0)
close_px.info()
 
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
AAPL    2214 non-null float64
MSFT    2214 non-null float64
XOM     2214 non-null float64
SPX     2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
In [26]:
close_px[-4:]
Out[26]:
              AAPL   MSFT    XOM      SPX
2011-10-11  400.29  27.00  76.27  1195.54
2011-10-12  402.19  26.96  77.16  1207.25
2011-10-13  408.43  27.18  76.37  1203.66
2011-10-14  422.00  27.27  78.11  1224.58
In [27]:
rets = close_px.pct_change().dropna() # 计算百分比变化,并过滤 NA
spx_corr = lambda x: x.corrwith(x['SPX']) # corrwith 返回Series
by_year = rets.groupby(lambda x: x.year) # 按年份分组
by_year.apply(spx_corr)
Out[27]:
          AAPL      MSFT       XOM  SPX
2003  0.541124  0.745174  0.661265    1
2004  0.374283  0.588531  0.557742    1
2005  0.467540  0.562374  0.631010    1
2006  0.428267  0.406126  0.518514    1
2007  0.508118  0.658770  0.786264    1
2008  0.681434  0.804626  0.828303    1
2009  0.707103  0.654902  0.797921    1
2010  0.710105  0.730118  0.839057    1
2011  0.691931  0.800996  0.859975    1
In [28]:
# 微软和苹果的年度相关系数  Annual correlation of Apple with Microsoft
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))
Out[28]:
2003    0.480868
2004    0.259024
2005    0.300093
2006    0.161735
2007    0.417738
2008    0.611901
2009    0.432738
2010    0.571946
2011    0.581987
dtype: float64
 

示例:面向分组的线性回归

Example: Group-wise linear regression
顺着上一个例子继续,你可以用groupby执行更为复杂的分组统计分析,只要函数返回的是pandas对象或标量值即可。

In [29]:
'定义regress函数(利用statsmodels库)对各数据块执行普通最小二乘法(Ordinary Least Squares,OLS)回归'
import statsmodels.api as sm
def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    return result.params
In [30]:
by_year.apply(regress, 'AAPL', ['SPX'])
Out[30]:
           SPX  intercept
2003  1.195406   0.000710
2004  1.363463   0.004201
2005  1.766415   0.003246
2006  1.645496   0.000080
2007  1.198761   0.003438
2008  0.968016  -0.001110
2009  0.879103   0.002954
2010  1.052608   0.001261
2011  0.806605   0.001514
 

数据透视表和交叉图

Pivot tables and Cross-tabulation
透视表(pivot table)是各种电子表格程序和其他数据分析软件中一种常见的数据汇总工具。他根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个矩形区域中。
在Python和pandas 中,可以通过本章所介绍的groupby功能以及(能够利用层次化索引的)重塑运算制作透视表。DataFrame有一个pivot_table方法,此外还有一个顶级的pandas.pivot_table函数。除能为groupby提供便利之外,pivot_table还可以添加分项小计(也叫作margins)。

In [33]:
'根据 sex 和 somker 计算分组平均数(pivot_table的默认聚合类型)  这里跳过了非数值类型 day 列数据'
tips.pivot_table(index=['sex', 'smoker'])
Out[33]:
                   size       tip   tip_pct  total_bill
sex    smoker                                          
Female No      2.592593  2.773519  0.156921   18.105185
       Yes     2.242424  2.931515  0.182150   17.977879
Male   No      2.711340  3.113402  0.160669   19.791237
       Yes     2.500000  3.051167  0.152771   22.284500
In [34]:
'我们只想聚合 tip_pct 和 size 根据 day 进行行分组 (注意传入的参数与上面的不同)'
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker')
Out[34]:
              tip_pct                size          
smoker             No       Yes        No       Yes
sex    day                                         
Female Fri   0.165296  0.209129  2.500000  2.000000
       Sat   0.147993  0.163817  2.307692  2.200000
       Sun   0.165710  0.237075  3.071429  2.500000
       Thur  0.155971  0.163073  2.480000  2.428571
Male   Fri   0.138005  0.144730  2.000000  2.125000
       Sat   0.162132  0.139067  2.656250  2.629630
       Sun   0.158291  0.173964  2.883721  2.600000
       Thur  0.165706  0.164417  2.500000  2.300000
In [35]:
'margins=True  添加分项小计。这将会添加标签为ALL的行和列,其值对应于单个等级中所有数据的分组统计'
tips.pivot_table(['tip_pct', 'size'], index=['sex', 'day'],
                 columns='smoker', margins=True)
Out[35]:
              tip_pct                          size                    
smoker             No       Yes       All        No       Yes       All
sex    day                                                             
Female Fri   0.165296  0.209129  0.199388  2.500000  2.000000  2.111111
       Sat   0.147993  0.163817  0.156470  2.307692  2.200000  2.250000
       Sun   0.165710  0.237075  0.181569  3.071429  2.500000  2.944444
       Thur  0.155971  0.163073  0.157525  2.480000  2.428571  2.468750
Male   Fri   0.138005  0.144730  0.143385  2.000000  2.125000  2.100000
       Sat   0.162132  0.139067  0.151577  2.656250  2.629630  2.644068
       Sun   0.158291  0.173964  0.162344  2.883721  2.600000  2.810345
       Thur  0.165706  0.164417  0.165276  2.500000  2.300000  2.433333
All          0.159328  0.163196  0.160803  2.668874  2.408602  2.569672
In [36]:
'aggfunc=func func是其他的聚合函数,这里使用的是 len'
tips.pivot_table('tip_pct', index=['sex', 'smoker'], columns='day',
                 aggfunc=len, margins=True)
Out[36]:
day            Fri  Sat  Sun  Thur  All
sex    smoker                          
Female No        2   13   14    25   54
       Yes       7   15    4     7   33
Male   No        2   32   43    20   97
       Yes       8   27   15    10   60
All             19   87   76    62  244
In [37]:
'fill_value=  存在空的组合的默认值'
tips.pivot_table('size', index=['time', 'sex', 'smoker'],
                 columns='day', aggfunc='sum', fill_value=0)
Out[37]:
day                   Fri  Sat  Sun  Thur
time   sex    smoker                     
Dinner Female No        2   30   43     2
              Yes       8   33   10     0
       Male   No        4   85  124     0
              Yes      12   71   39     0
Lunch  Female No        3    0    0    60
              Yes       6    0    0    17
       Male   No        0    0    0    50
              Yes       5    0    0    23
 

表9-2:pivot_table的参数说明:

values     待聚合的列的名称。默认聚合所有数值列
index       用于分组的列名或其他分组键,出现在结果透视表的行
columns     用于分组的列名或其他分组键,出现在结果透视表的列
aggfunc     聚合函数或函数列表,默认为'mean'.可以是任何对groupby有效的函数
fill_value   用于替换结果表中的缺失值
dropna=True  过滤缺失值
margins     添加行/列小计和总计,默认为False
margins_name='All'
 

交叉表:crosstab

Cross-tabulations: crosstab
交叉表(cross-tabulation,简称crosstab)是一种用于计算分组频率的特殊透视表。

In [19]:
#from StringIO import StringIO
data = """\
Sample    Gender    Handedness
1    Female    Right-handed
2    Male    Left-handed
3    Female    Right-handed
4    Male    Right-handed
5    Male    Left-handed
6    Male    Right-handed
7    Female    Right-handed
8    Female    Left-handed
9    Male    Right-handed
10    Female    Right-handed"""

f = open('data_save.txt','w')
f.write(data)
f.close()

data = pd.read_table('data_save.txt', sep='\s+', engine='python')
In [20]:
!type data_save.txt
 
Sample    Gender    Handedness
1    Female    Right-handed
2    Male    Left-handed
3    Female    Right-handed
4    Male    Right-handed
5    Male    Left-handed
6    Male    Right-handed
7    Female    Right-handed
8    Female    Left-handed
9    Male    Right-handed
10    Female    Right-handed
In [21]:
data
Out[21]:
   Sample  Gender    Handedness
0       1  Female  Right-handed
1       2    Male   Left-handed
2       3  Female  Right-handed
3       4    Male  Right-handed
4       5    Male   Left-handed
5       6    Male  Right-handed
6       7  Female  Right-handed
7       8  Female   Left-handed
8       9    Male  Right-handed
9      10  Female  Right-handed
In [22]:
'根据性别和用手习惯对这段数据进行统计汇总,虽然可以用pivot_table实现该功能,但是pandas.crosstab函数会更方便:'
pd.crosstab(data.Gender, data.Handedness, margins=True)
Out[22]:
Handedness  Left-handed  Right-handed  All
Gender                                    
Female                1             4    5
Male                  2             3    5
All                   3             7   10
In [31]:
'对应的 pivot_table 使用方法如下:'
data.pivot_table('Sample', index='Gender', columns='Handedness', aggfunc='count', margins=True)
Out[31]:
Handedness  Left-handed  Right-handed  All
Gender                                    
Female                1             4    5
Male                  2             3    5
All                   3             7   10
In [23]:
'crosstab的前两个参数可以是数组,Series或数组列表。'
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
Out[23]:
smoker        No  Yes  All
time   day                
Dinner Fri     3    9   12
       Sat    45   42   87
       Sun    57   19   76
       Thur    1    0    1
Lunch  Fri     1    6    7
       Thur   44   17   61
All          151   93  244
 

示例:2012联邦选举委员会数据库

Example: 2012 Federal Election Commission Database
这是一个美国联邦选举委员会发布了的有关政治精选赞助方面的数据。其中包括赞助者的姓名,职业,雇主,地址以及出资额等信息。

In [65]:
fec = pd.read_csv('ch09/P00000001-ALL.csv')
 
D:\zwPython\py35\python-3.5.1.amd64\lib\site-packages\IPython\core\interactiveshell.py:2723: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [66]:
'信息摘要'
fec.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
cmte_id              1001731 non-null object
cand_id              1001731 non-null object
cand_nm              1001731 non-null object
contbr_nm            1001731 non-null object
contbr_city          1001712 non-null object
contbr_st            1001727 non-null object
contbr_zip           1001620 non-null object
contbr_employer      988002 non-null object
contbr_occupation    993301 non-null object
contb_receipt_amt    1001731 non-null float64
contb_receipt_dt     1001731 non-null object
receipt_desc         14166 non-null object
memo_cd              92482 non-null object
memo_text            97770 non-null object
form_tp              1001731 non-null object
file_num             1001731 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 129.9+ MB
In [67]:
fec.ix[123456] # 查看第 123456 行的信息
Out[67]:
cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                          50
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object
In [68]:
'.unique(): 获取全部候选人名单'
unique_cands = fec.cand_nm.unique()
unique_cands
Out[68]:
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick', 'Cain, Herman',
       'Gingrich, Newt', 'McCotter, Thaddeus G', 'Huntsman, Jon',
       'Perry, Rick'], dtype=object)
In [69]:
unique_cands[2]
Out[69]:
'Obama, Barack'
In [70]:
'利用字典说明党派关系:'
parties = {'Bachmann, Michelle': 'Republican',
           'Cain, Herman': 'Republican',
           'Gingrich, Newt': 'Republican',
           'Huntsman, Jon': 'Republican',
           'Johnson, Gary Earl': 'Republican',
           'McCotter, Thaddeus G': 'Republican',
           'Obama, Barack': 'Democrat',
           'Paul, Ron': 'Republican',
           'Pawlenty, Timothy': 'Republican',
           'Perry, Rick': 'Republican',
           "Roemer, Charles E. 'Buddy' III": 'Republican',
           'Romney, Mitt': 'Republican',
           'Santorum, Rick': 'Republican'}
In [71]:
fec.cand_nm[123456:123461]
Out[71]:
123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
In [72]:
'通过字典映射以及Series对象的map方法,可以根据候选人姓名得到一组党派信息。'
fec.cand_nm[123456:123461].map(parties)
Out[72]:
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object
In [73]:
# 将其添加到一个新列  Add it as a column
fec['party'] = fec.cand_nm.map(parties)
In [74]:
fec['party'].value_counts()
Out[74]:
Democrat      593746
Republican    407985
Name: party, dtype: int64
In [75]:
'上面的数据既包括赞助的,也包括退款(负的出资额)'
(fec.contb_receipt_amt > 0).value_counts()
Out[75]:
True     991475
False     10256
Name: contb_receipt_amt, dtype: int64
In [76]:
'选取只有正的出资额的数据集'
fec = fec[fec.contb_receipt_amt > 0]
In [77]:
"只选取包含'Obama, Barack'和 'Romney, Mitt'的数据集,因为他们是最主要的两名候选人。"
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack', 'Romney, Mitt'])]
 

根据职业和雇主统计赞助信息

Donation statistics by occupation and employer
基于职业的赞助信息统计是另一种经常被研究的统计任务。例如,律师们更倾向于资助民主党,而企业主则更倾向于资助共和党。

In [78]:
'根据职业计算出资总额'
fec.contbr_occupation.value_counts()[:10]
Out[78]:
RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64
In [79]:
'许多职业都涉及相同的基本工作类型,下面代码可以清理这样的数据(将一个职业信息映射到另一个)。注意:这里巧妙的运用了dict.get,'
'他允许没有映射关系的职业也能‘通过’'
occ_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'INFORMATION REQUESTED (BEST EFFORTS)' : 'NOT PROVIDED',
   'C.E.O.': 'CEO'
}

# 如果没有提供相关映射,则返回x
f = lambda x: occ_mapping.get(x, x)
fec.contbr_occupation = fec.contbr_occupation.map(f)
In [80]:
emp_mapping = {
   'INFORMATION REQUESTED PER BEST EFFORTS' : 'NOT PROVIDED',
   'INFORMATION REQUESTED' : 'NOT PROVIDED',
   'SELF' : 'SELF-EMPLOYED',
   'SELF EMPLOYED' : 'SELF-EMPLOYED',
}

# 如果没有提供相关映射,则返回x
f = lambda x: emp_mapping.get(x, x)
fec.contbr_employer = fec.contbr_employer.map(f)
In [81]:
'通过pivot_table根据 党派 和 职业 对数据进行聚合'
by_occupation = fec.pivot_table('contb_receipt_amt',
                                index='contbr_occupation',
                                columns='party', aggfunc='sum')
In [82]:
'选对两个党派总赞助大于 2000000 的 职业'
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
over_2mm
Out[82]:
party                 Democrat       Republican
contbr_occupation                              
ATTORNEY           11141982.97   7477194.430000
CEO                 2074974.79   4211040.520000
CONSULTANT          2459912.71   2544725.450000
ENGINEER             951525.55   1818373.700000
EXECUTIVE           1355161.05   4138850.090000
HOMEMAKER           4248875.80  13634275.780000
INVESTOR             884133.00   2431768.920000
LAWYER              3160478.87    391224.320000
MANAGER              762883.22   1444532.370000
NOT PROVIDED        4866973.96  20565473.010000
OWNER               1001567.36   2408286.920000
PHYSICIAN           3735124.94   3594320.240000
PRESIDENT           1878509.95   4720923.760000
PROFESSOR           2165071.08    296702.730000
REAL ESTATE          528902.09   1625902.250000
RETIRED            25305116.38  23561244.489999
SELF-EMPLOYED        672393.40   1640252.540000
In [83]:
'做成柱状图更加清楚'
over_2mm.plot(kind='barh')
Out[83]:
<matplotlib.axes._subplots.AxesSubplot at 0xacf5cf8>
 
In [91]:
def get_top_amounts(group, key, n=5):
    # 根据 key 对 group 分组, 并对 'contb_receipt_amt' 数据进行分组 sum
    totals = group.groupby(key)['contb_receipt_amt'].sum()

    # 返回赞助最大的前n个 职业
    return totals.sort_values(ascending=False)[:n]
In [92]:
'根据 候选人 + 职业 进行聚合'
grouped = fec_mrbo.groupby('cand_nm')
'注意这里可以看成是一个类的实例化,他的第一个参数 group 类似于 self ,默认会省略这个参数的传入。'
grouped.apply(get_top_amounts, 'contbr_occupation', n=7) # 二次分组(grouped已经是一个分组的数据)
Out[92]:
cand_nm        contbr_occupation                     
Obama, Barack  RETIRED                                   25305116.38
               ATTORNEY                                  11141982.97
               INFORMATION REQUESTED                      4866973.96
               HOMEMAKER                                  4248875.80
               PHYSICIAN                                  3735124.94
               LAWYER                                     3160478.87
               CONSULTANT                                 2459912.71
Romney, Mitt   RETIRED                                   11508473.59
               INFORMATION REQUESTED PER BEST EFFORTS    11396894.84
               HOMEMAKER                                  8147446.22
               ATTORNEY                                   5364718.82
               PRESIDENT                                  2491244.89
               EXECUTIVE                                  2300947.03
               C.E.O.                                     1968386.11
Name: contb_receipt_amt, dtype: float64
In [93]:
'根据 候选人 + 雇主 进行聚合'
grouped.apply(get_top_amounts, 'contbr_employer', n=10)
Out[93]:
cand_nm        contbr_employer                       
Obama, Barack  RETIRED                                   22694358.85
               SELF-EMPLOYED                             17080985.96
               NOT EMPLOYED                               8586308.70
               INFORMATION REQUESTED                      5053480.37
               HOMEMAKER                                  2605408.54
               SELF                                       1076531.20
               SELF EMPLOYED                               469290.00
               STUDENT                                     318831.45
               VOLUNTEER                                   257104.00
               MICROSOFT                                   215585.36
Romney, Mitt   INFORMATION REQUESTED PER BEST EFFORTS    12059527.24
               RETIRED                                   11506225.71
               HOMEMAKER                                  8147196.22
               SELF-EMPLOYED                              7409860.98
               STUDENT                                     496490.94
               CREDIT SUISSE                               281150.00
               MORGAN STANLEY                              267266.00
               GOLDMAN SACH & CO.                          238250.00
               BARCLAYS CAPITAL                            162750.00
               H.I.G. CAPITAL                              139500.00
Name: contb_receipt_amt, dtype: float64
 

对出资额分组

Bucketing donation amounts

In [94]:
'还可以对该数据做另一种非常实用的分析:利用cut函数根据出资额的大小将数据离散化到多个面元中'
bins = np.array([0, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000])
labels = pd.cut(fec_mrbo.contb_receipt_amt, bins)
labels
Out[94]:
411           (10, 100]
412         (100, 1000]
413         (100, 1000]
414           (10, 100]
415           (10, 100]
416           (10, 100]
417         (100, 1000]
418           (10, 100]
419         (100, 1000]
420           (10, 100]
421           (10, 100]
422         (100, 1000]
423         (100, 1000]
424         (100, 1000]
425         (100, 1000]
426         (100, 1000]
427       (1000, 10000]
428         (100, 1000]
429         (100, 1000]
430           (10, 100]
431       (1000, 10000]
432         (100, 1000]
433         (100, 1000]
434         (100, 1000]
435         (100, 1000]
436         (100, 1000]
437           (10, 100]
438         (100, 1000]
439         (100, 1000]
440           (10, 100]
              ...      
701356        (10, 100]
701357          (1, 10]
701358        (10, 100]
701359        (10, 100]
701360        (10, 100]
701361        (10, 100]
701362      (100, 1000]
701363        (10, 100]
701364        (10, 100]
701365        (10, 100]
701366        (10, 100]
701367        (10, 100]
701368      (100, 1000]
701369        (10, 100]
701370        (10, 100]
701371        (10, 100]
701372        (10, 100]
701373        (10, 100]
701374        (10, 100]
701375        (10, 100]
701376    (1000, 10000]
701377        (10, 100]
701378        (10, 100]
701379      (100, 1000]
701380    (1000, 10000]
701381        (10, 100]
701382      (100, 1000]
701383          (1, 10]
701384        (10, 100]
701385      (100, 1000]
Name: contb_receipt_amt, dtype: category
Categories (8, object): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
In [95]:
'根据候选人姓名和面元标签对数据进行分组'
grouped = fec_mrbo.groupby(['cand_nm', labels])
'统计分组后的 数量'
grouped.size().unstack(0)
Out[95]:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                         493            77
(1, 10]                      40070          3681
(10, 100]                   372280         31853
(100, 1000]                 153991         43357
(1000, 10000]                22284         26186
(10000, 100000]                  2             1
(100000, 1000000]                3           NaN
(1000000, 10000000]              4           NaN
In [96]:
'对出资额求和并在面元内规格化,以便图形化显示两位候选人各种赞助额度的比例:'
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0)
bucket_sums
Out[96]:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                      318.24         77.00
(1, 10]                  337267.62      29819.66
(10, 100]              20288981.41    1987783.76
(100, 1000]            54798531.46   22363381.69
(1000, 10000]          51753705.67   63942145.42
(10000, 100000]           59100.00      12700.00
(100000, 1000000]       1490683.08           NaN
(1000000, 10000000]     7148839.76           NaN
In [97]:
'对上式数据规范化'
normed_sums = bucket_sums.div(bucket_sums.sum(axis=1), axis=0)
normed_sums
Out[97]:
cand_nm              Obama, Barack  Romney, Mitt
contb_receipt_amt                               
(0, 1]                    0.805182      0.194818
(1, 10]                   0.918767      0.081233
(10, 100]                 0.910769      0.089231
(100, 1000]               0.710176      0.289824
(1000, 10000]             0.447326      0.552674
(10000, 100000]           0.823120      0.176880
(100000, 1000000]         1.000000           NaN
(1000000, 10000000]       1.000000           NaN
In [98]:
'画图:这里排除了2个最大的面元,因为这些不是有个人捐献的。'
normed_sums[:-2].plot(kind='barh', stacked=True)
Out[98]:
<matplotlib.axes._subplots.AxesSubplot at 0xb47cfd0>
 
 

当然,还可以对该分析过程做许多的提炼和改进。比如说,可以根据赞助人的姓名和邮编对数据进行聚合,以便找出哪些人进行了多次小额捐款,哪些人又进行了一次或多次大额捐款。我强烈建议你下载这些数据并自己摸索一下。

 

根据州统计赞助信息

Donation statistics by state

In [99]:
grouped = fec_mrbo.groupby(['cand_nm', 'contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals = totals[totals.sum(1) > 100000]
totals[:10]
Out[99]:
cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK             281840.15      86204.24
AL             543123.48     527303.51
AR             359247.28     105556.00
AZ            1506476.98    1888436.23
CA           23824984.24   11237636.60
CO            2132429.49    1506714.12
CT            2068291.26    3499475.45
DC            4373538.80    1025137.50
DE             336669.14      82712.00
FL            7318178.58    8338458.81
In [100]:
percent = totals.div(totals.sum(1), axis=0)
percent[:10]
Out[100]:
cand_nm    Obama, Barack  Romney, Mitt
contbr_st                             
AK              0.765778      0.234222
AL              0.507390      0.492610
AR              0.772902      0.227098
AZ              0.443745      0.556255
CA              0.679498      0.320502
CO              0.585970      0.414030
CT              0.371476      0.628524
DC              0.810113      0.189887
DE              0.802776      0.197224
FL              0.467417      0.532583
 

'下面的内容本来应该是有一大段代码(内容应该是把各州赞助的总额从地图上绘出来),但是本文档却没有给出任何信息。而且翻译的作者也说跑不出来这段代码,因此这里忽略了。'

In [ ]:
 
posted @ 2017-04-02 18:06  she35  阅读(595)  评论(0)    收藏  举报