pandas笔记-ch05:pandas入门

 

ch05

 

 

 

pandas入门

In [1]:
from pandas import Series, DataFrame
import pandas as pd
In [2]:
'以下不知道什么意思。'
%matplotlib inline
from __future__ import division
from numpy.random import randn
import numpy as np
import os
import matplotlib.pyplot as plt
np.random.seed(12345)
In [3]:
plt.rc('figure', figsize=(10, 6))
In [4]:
from pandas import Series, DataFrame
import pandas as pd
np.set_printoptions(precision=4)
In [9]:
%pwd
Out[9]:
'D:\\WinPython-32bit-3.4.3.7\\notebooks\\Python for Data Analysis'
 

pandas数据结构的介绍

 

Series

理解1:一维数组 + 索引值
理解2:定长有序字典

In [14]:
'Series = 一维数组 + 索引值'
obj = Series([4, 7, -5, 3])
obj
Out[14]:
0    4
1    7
2   -5
3    3
dtype: int64
In [15]:
'访问 Series 的 values 和 index'
obj.values
Out[15]:
array([ 4,  7, -5,  3], dtype=int64)
In [16]:
obj.index
Out[16]:
Int64Index([0, 1, 2, 3], dtype='int64')
In [18]:
'自定义 index'
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
Out[18]:
d    4
b    7
a   -5
c    3
dtype: int64
In [19]:
obj2.index
Out[19]:
Index(['d', 'b', 'a', 'c'], dtype='object')
In [21]:
'通过索引取值'
obj2['a']
Out[21]:
-5
In [22]:
'索引赋值'
obj2['d'] = 6
obj2[['c', 'a', 'd']]
Out[22]:
c    3
a   -5
d    6
dtype: int64
In [27]:
'一般的数学运算会保留 index 与 values 的联系'
obj2[obj2 > 0]
Out[27]:
d    6
b    7
c    3
dtype: int64
In [28]:
obj2 * 2
Out[28]:
d    12
b    14
a   -10
c     6
dtype: int64
In [29]:
'注意,这里是 np.exp 方法接受 pd 参数'
np.exp(obj2)
Out[29]:
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
In [30]:
'把Series看成定长有序的 字典'
'b' in obj2
Out[30]:
True
In [31]:
'e' in obj2
Out[31]:
False
In [32]:
'通过字典创建Series'
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3
Out[32]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
In [34]:
'字典中的 键 不在 index中则返回 NaN'
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4
Out[34]:
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
In [37]:
'isnull notnull 检测 NaN 值'
pd.isnull(obj4) # 顶层方法检测
Out[37]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
In [38]:
pd.notnull(obj4)
Out[38]:
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
In [40]:
obj4.isnull() #实例化方法检测
Out[40]:
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
In [41]:
'Series在算术运算中自动对齐不同索引的数据'
obj3
Out[41]:
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64
In [42]:
obj4
Out[42]:
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
dtype: float64
In [43]:
obj3 + obj4
Out[43]:
California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah             NaN
dtype: float64
In [44]:
'Series本身及其索引都有一个name属性,该属性与pandas的其他关键功能关系密切'
obj4.name = 'population'
obj4.index.name = 'state'
obj4
Out[44]:
state
California      NaN
Ohio          35000
Oregon        16000
Texas         71000
Name: population, dtype: float64
In [46]:
'通过赋值修改索引'
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
Out[46]:
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64
 

DataFrame

In [50]:
'由等长列表或数组构成DataFrame'
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

frame #自动添加索引,列有序排序
Out[50]:
 popstateyear
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
In [52]:
'指定顺序对列进行排序'
DataFrame(data, columns=['year', 'state', 'pop'])
Out[52]:
 yearstatepop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
In [53]:
'索引值与 产生NaN的行为'
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                   index=['one', 'two', 'three', 'four', 'five'])
frame2
Out[53]:
 yearstatepopdebt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN
In [54]:
frame2.columns
Out[54]:
Index(['year', 'state', 'pop', 'debt'], dtype='object')
In [55]:
'通过字典获取Series'
frame2['state']
Out[55]:
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object
In [56]:
'通过属性获取Series'
frame2.year
Out[56]:
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
In [57]:
'通过 framer.ix: 获取Series'
frame2.ix['three']
Out[57]:
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object
In [58]:
'赋值修改DataFrame'
frame2['debt'] = 16.5
frame2
Out[58]:
 yearstatepopdebt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
In [59]:
frame2['debt'] = np.arange(5.)
frame2
Out[59]:
 yearstatepopdebt
one 2000 Ohio 1.5 0
two 2001 Ohio 1.7 1
three 2002 Ohio 3.6 2
four 2001 Nevada 2.4 3
five 2002 Nevada 2.9 4
In [61]:
'列表或数组赋值给DataFrame列,长度必须对应相同, 如果赋值是一个Series,则通过索引匹配赋值'
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2
Out[61]:
 yearstatepopdebt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
In [65]:
'创建新列:给不在column中的元素赋值'
frame2['eastern'] = frame2.state == 'Ohio'
frame2
Out[65]:
 yearstatepopdebteastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
In [66]:
'删除列'
del frame2['eastern']
frame2.columns
Out[66]:
Index(['year', 'state', 'pop', 'debt'], dtype='object')
In [5]:
'嵌套字典赋值DataFrame,外层键为列,内层键为行索引'
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
In [6]:
frame3 = DataFrame(pop)
frame3
Out[6]:
 NevadaOhio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [7]:
'转置'
frame3.T
Out[7]:
 200020012002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
In [8]:
'默认下,内层字典的键会被合并、排序以形成最终的索引。如果显式指定了索引,则不会这样'
DataFrame(pop, index=[2001, 2002, 2003])
Out[8]:
 NevadaOhio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN
In [9]:
'由Series组成的字典差不多也是一样的用法'
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)
Out[9]:
 NevadaOhio
2000 NaN 1.5
2001 2.4 1.7
 

表5-1:可以输入给DataFrame构造器的数据

二维ndarray         数据矩阵,还可以传入行标和列标
由数组、列标或元组组成的字典  每个序列会变成DataFrame的一列。所有序列的长度必须相同
NumPy的结构化/记录数组    类似于‘由数组组成的字典’
有Series组成的字典    每个Series会成为一列。如果没有显示指定索引,则各Series的索引会被合并成结果的行索引
由字典组成的字典     各内层字典会成为一列。键会被合并成结果的行索引,个‘由Series组成的字典’的情况一样
字典或Series的列表    各项将成为DataFrame的一行。字典键或Series索引的并集将成为DataFrame的列标
由列表或元组组成的列表 类似于‘二维ndarray’
另一个DataFrame      该DataFrame的索引将被沿用,除非显式指定了其他索引
Numpy的MaskedArray    类似于‘二维ndarray’的情况,只是掩码值在结果DataFrame会变成NA/缺失值
In [10]:
'指定DataFrame的index 和 columns 的 name'
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3
Out[10]:
stateNevadaOhio
year  
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [73]:
'values属性以二维ndarray的形式返回'
frame3.values
Out[73]:
array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])
In [74]:
'各列数据类型不同的DataFrame的values情况'
frame2.values
Out[74]:
array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)
 

索引对象

In [76]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Out[76]:
Index(['a', 'b', 'c'], dtype='object')
In [77]:
index[1:]
Out[77]:
Index(['b', 'c'], dtype='object')
In [80]:
'index对象不可修改,以保证index对象在多个数据结构之间安全共享'
index[1] = 'd'
 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-80-8b1cf8f64320> in <module>()
      1 'index对象不可修改,以保证index对象在多个数据结构之间安全共享'
----> 2index[1] = 'd'

D:\WinPython-32bit-3.4.3.7\python-3.4.3\lib\site-packages\pandas\core\index.py in __setitem__(self, key, value)
   1128 
   1129     def __setitem__(self, key, value):
-> 1130raise TypeError("Index does not support mutable operations")
   1131 
   1132     def __getitem__(self, key):

TypeError: Index does not support mutable operations
In [81]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index
Out[81]:
True
 

表5-2:pandas中主要的index对象

Iindex      最泛化的index对象,将轴标签表示为一个由Python对象组成的Numpy数组
Int64Index   针对整数的特殊index
MultiIndex   ‘层次化’索引对象,表示单个轴上的多层索引。可以看做由单元组组成的数组
DatetimeIndex 存储纳秒级的时间戳(用Numpy的datetime64类型表示)
PeriodIndex   针对Period数据(时间间隔)的特殊Index
In [82]:
'除了长得像数组,index的功能也类似一个固定大小的集合'
frame3
Out[82]:
stateNevadaOhio
year  
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
In [83]:
'Ohio' in frame3.columns
Out[83]:
True
In [84]:
2003 in frame3.index
Out[84]:
False
每个索引都有一些方法和属性,他们可以用于设置逻辑并回答有关该索引所包含的数据的常见问题。表5-3列出了这些函数:
 

表5-3 index的属性和方法:

append      连接另一个Index对象,产生一个新的Index
diff       计算差集,并得到一个Index
intersection  计算交集
union       计算并集
isin       计算一个指示各值是否都包含在参数合中的布尔型数组
delete      删除索引i处的元素,并得到新的Index
drop       删除传入的值,并得到新的Index
insert      将元素插入到索引i处,并得到新的Index
is_monotonic  当各元素均大于等于前一个元素时,返回True
is_unique    当Index没有重复时,返回True
unique      计算Index中唯一值的数组
 

基本功能

 

重新索引

In [4]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
Out[4]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [5]:
'reindex重建索引(改变索引的排序顺序),不存在对应索引值则为NaN'
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
Out[5]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
In [7]:
'reindex(fill_value),NaN的填充值'
obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)
Out[7]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
In [9]:
'插值:method='
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')
Out[9]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object
 

表5-4:reindex的插值(method)选项

ffill或pad    前后填充(或搬运)值
bfill或backfill 向后填充(或搬运)值
In [11]:
'reindex既可以修改行索引,也可以修改列'
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                  columns=['Ohio', 'Texas', 'California'])
frame
Out[11]:
 OhioTexasCalifornia
a 0 1 2
c 3 4 5
d 6 7 8
In [12]:
frame2 = frame.reindex(['a', 'b', 'c', 'd']) #默认行索引
frame2
Out[12]:
 OhioTexasCalifornia
a 0 1 2
b NaN NaN NaN
c 3 4 5
d 6 7 8
In [13]:
states = ['Texas', 'Utah', 'California'] #列 重新排序
frame.reindex(columns=states)
Out[13]:
 TexasUtahCalifornia
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [14]:
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
              columns=states) #行索引,列同时排序
Out[14]:
 TexasUtahCalifornia
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
In [16]:
'ix 的标签索引功能,重新索引使用起来看起来更简单'
frame.ix[['a', 'b', 'c', 'd'], states]
Out[16]:
 TexasUtahCalifornia
a 1 NaN 2
b NaN NaN NaN
c 4 NaN 5
d 7 NaN 8
 

表5-5reindex函数的参数:

index      用作索引的新序列。既可以是Index实例,也可以是其他序列型的Python数据结构。Index会被完全使用,就像没有任何复制一样
method     插值(填充)方式,具体参数请参见表5-4
fill_value   在重新索引的过程中,需要引入缺失值时使用的替代值
limit      前向或后向填充时的最大填充量
level      在MultiIndex的指定级别上匹配简单索引,否则选取其子集
copy       默认为True,无论如何都复制;如果为False,则新旧相等就不复制
 

丢弃指定轴上的项

本质:先复制,在处理,返回处理后的结果。

In [17]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj
Out[17]:
a    0
b    1
d    3
e    4
dtype: float64
In [18]:
obj.drop(['d', 'c'])
Out[18]:
a    0
b    1
e    4
dtype: float64
In [19]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
In [20]:
data.drop(['Colorado', 'Ohio'])
Out[20]:
 onetwothreefour
Utah 8 9 10 11
New York 12 13 14 15
In [21]:
data.drop('two', axis=1)
Out[21]:
 onethreefour
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
In [22]:
data.drop(['two', 'four'], axis=1)
Out[22]:
 onethree
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
 

索引,选取和过滤

In [26]:
'Series索引的使用方法与array类似,只不过索引值还可以是其他形式'
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']
Out[26]:
1.0
In [27]:
obj[1]
Out[27]:
1.0
In [28]:
obj[2:4]
Out[28]:
c    2
d    3
dtype: float64
In [29]:
obj[['b', 'a', 'd']]
Out[29]:
b    1
a    0
d    3
dtype: float64
In [30]:
obj[[1, 3]]
Out[30]:
b    1
d    3
dtype: float64
In [31]:
obj[obj < 2]
Out[31]:
a    0
b    1
dtype: float64
In [32]:
'注意这里的索引值的切片与Python的切片的不同:'
obj['b':'c']
Out[32]:
b    1
c    2
dtype: float64
In [33]:
obj['b':'c'] = 5
obj
Out[33]:
a    0
b    5
c    5
d    3
dtype: float64
In [37]:
'DataFrame索引'
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
data
Out[37]:
 onetwothreefour
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [38]:
data['two']
Out[38]:
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
In [39]:
data[['three', 'one']]
Out[39]:
 threeone
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
In [40]:
data[:2]
Out[40]:
 onetwothreefour
Ohio 0 1 2 3
Colorado 4 5 6 7
In [41]:
data[data['three'] > 5]
Out[41]:
 onetwothreefour
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [42]:
data < 5
Out[42]:
 onetwothreefour
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
In [43]:
data[data < 5] = 0
In [44]:
data
Out[44]:
 onetwothreefour
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [45]:
'ix 数值以及行列标签的(切片+索引)'
data.ix['Colorado', ['two', 'three']]
Out[45]:
two      5
three    6
Name: Colorado, dtype: int32
In [46]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]
Out[46]:
 fouronetwo
Colorado 7 0 5
Utah 11 8 9
In [47]:
data.ix[2]
Out[47]:
one       8
two       9
three    10
four     11
Name: Utah, dtype: int32
In [48]:
data.ix[:'Utah', 'two']
Out[48]:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32
In [49]:
data.ix[data.three > 5, :3]
Out[49]:
 onetwothree
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
In [51]:
data.ix[2:4,1:3]
Out[51]:
 twothree
Utah 9 10
New York 13 14
 

表5-6:DataFrame的索引选项

obj[val]   选取DataFrame的单个列或一组列。在一些特殊情况下会比较便利:布尔型数组(过滤行)、切片(行切片)、布尔型DataFrame(根据条件设置值)
obj.ix[val]   选取DataFrame的单个行或一组行
obj.ix[:,val]  选取单个列或列子集
obj.ix[val1,val2]  同时选取行和列
reindex方法   将一个或多个轴匹配到新索引
xs方法       根据标签选取单行或单列,并返回一个Series
icol,irow方法  根据整数位置选取单列或单行,并返回一个Series
get_value,set_value方法  根据行标签和列标签选取单个值
 

算术运算和数据对齐

In [55]:
'针对Series'
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [56]:
s1
Out[56]:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
In [57]:
s2
Out[57]:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
In [58]:
s1 + s2
Out[58]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
In [61]:
'针对DataFrame:'
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
Out[61]:
 bcd
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
In [62]:
df2
Out[62]:
 bde
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [63]:
df1 + df2
Out[63]:
 bcde
Colorado NaN NaN NaN NaN
Ohio 3 NaN 6 NaN
Oregon NaN NaN NaN NaN
Texas 9 NaN 12 NaN
Utah NaN NaN NaN NaN
 

在算术方法中填充值

In [64]:
df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1
Out[64]:
 abcd
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [65]:
df2
Out[65]:
 abcde
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
In [66]:
df1 + df2
Out[66]:
 abcde
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN
In [67]:
'算术方法中NaN的填充(这里只有一方有NaN,所以只填充一方)'
df1.add(df2, fill_value=0)
Out[67]:
 abcde
0 0 2 4 6 4
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19
In [68]:
'reindex中的NaN填充'
df1.reindex(columns=df2.columns, fill_value=0)
Out[68]:
 abcde
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0
 

表5-7:灵活的算术方法

add   用于加法(+)的方法
sub          -      
div          /
mul          *
 

DataFrame 和 Series 之间的运算

In [69]:
'Numpy中的广播运算'
arr = np.arange(12.).reshape((3, 4))
arr
Out[69]:
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])
In [70]:
arr[0]
Out[70]:
array([ 0.,  1.,  2.,  3.])
In [71]:
'这种运算称为广播(broadcasting)'
arr - arr[0]
Out[71]:
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])
In [72]:
'Pandas中的广播运算:'
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame
Out[72]:
 bde
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [73]:
series
Out[73]:
b    0
d    1
e    2
Name: Utah, dtype: float64
In [74]:
'默认情况下,DataFrame和Series之间的运算会将Series的索引匹配到DataFrame的列,然后沿着行向下广播'
frame - series
Out[74]:
 bde
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
In [75]:
'缺省值处理'
series2 = Series(range(3), index=['b', 'e', 'f'])
frame + series2
Out[75]:
 bdef
Utah 0 NaN 3 NaN
Ohio 3 NaN 6 NaN
Texas 6 NaN 9 NaN
Oregon 9 NaN 12 NaN
In [76]:
'行匹配,列广播的方法:'
series3 = frame['d']
frame
Out[76]:
 bde
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
In [77]:
series3
Out[77]:
Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: float64
In [78]:
'传入的轴是希望匹配的轴'
frame.sub(series3, axis=0)
Out[78]:
 bde
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1
 

函数应用和映射

In [79]:
'Numpy的ufuncs也可以用于操作pandas对象'
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [80]:
frame
Out[80]:
 bde
Utah -0.204708 0.478943 -0.519439
Ohio -0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 -1.296221
In [81]:
np.abs(frame)
Out[81]:
 bde
Utah 0.204708 0.478943 0.519439
Ohio 0.555730 1.965781 1.393406
Texas 0.092908 0.281746 0.769023
Oregon 1.246435 1.007189 1.296221
In [82]:
'DataFrame与函数的联合'
f = lambda x: x.max() - x.min()
In [83]:
frame.apply(f)
Out[83]:
b    1.802165
d    1.684034
e    2.689627
dtype: float64
In [84]:
frame.apply(f, axis=1) #自定义运用的轴
Out[84]:
Utah      0.998382
Ohio      2.521511
Texas     0.676115
Oregon    2.542656
dtype: float64
In [85]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)
#注:许多DataFrame的内置函数如sum,mean等功能可以直接使用,而不需要使用apply多此一举
Out[85]:
 bde
min -0.555730 0.281746 -1.296221
max 1.246435 1.965781 1.393406
In [86]:
'元素级的函数:applymap'
format = lambda x: '%.2f' % x
frame.applymap(format)
Out[86]:
 bde
Utah -0.20 0.48 -0.52
Ohio -0.56 1.97 1.39
Texas 0.09 0.28 0.77
Oregon 1.25 1.01 -1.30
In [87]:
frame['e'].map(format)
Out[87]:
Utah      -0.52
Ohio       1.39
Texas      0.77
Oregon    -1.30
Name: e, dtype: object
 

排序和排名

In [88]:
'sort_index按 行 或 列 索引排序'
obj = Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
Out[88]:
a    1
b    2
c    3
d    0
dtype: int32
In [89]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])
frame.sort_index()
Out[89]:
 dabc
one 4 5 6 7
three 0 1 2 3
In [90]:
frame.sort_index(axis=1)
Out[90]:
 abcd
three 1 2 3 0
one 5 6 7 4
In [91]:
frame.sort_index(axis=1, ascending=False)
Out[91]:
 dcba
three 0 3 2 1
one 4 7 6 5
In [92]:
'order按值进行排序(用法过时,应该为sort_values)'
obj = Series([4, 7, -3, 2])
obj.order()
 
D:\zwPython\py35\python-3.5.1.amd64\lib\site-packages\ipykernel\__main__.py:3: FutureWarning: order is deprecated, use sort_values(...)
  app.launch_new_instance()
Out[92]:
2   -3
3    2
0    4
1    7
dtype: int64
In [95]:
'sort_values按值进行排序'
obj = Series([4, 7, -3, 2])
obj.sort_values()
Out[95]:
2   -3
3    2
0    4
1    7
dtype: int64
In [96]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values() #NaN排在末尾
Out[96]:
4    -3
5     2
0     4
2     7
1   NaN
3   NaN
dtype: float64
In [97]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
Out[97]:
 ab
0 0 4
1 1 7
2 0 -3
3 1 2
In [98]:
'依据某一列(或者多列)的值进行排序:传递 by= 选项 (已经过时:应修改为sort_values(by=))'
frame.sort_index(by='b')
 
D:\zwPython\py35\python-3.5.1.amd64\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
  if __name__ == '__main__':
Out[98]:
 ab
2 0 -3
3 1 2
0 0 4
1 1 7
In [100]:
'正确做法:'
frame.sort_values(by='b')
Out[100]:
 ab
2 0 -3
3 1 2
0 0 4
1 1 7
In [102]:
'多列排序'
frame.sort_values(by=['a', 'b'])
Out[102]:
 ab
2 0 -3
0 0 4
3 1 2
1 1 7
 

不懂?????

In [103]:
'默认:method="average" 在相等分组中,为各个值分配平均排名'
obj = Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()
Out[103]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
In [104]:
'method="first" 按值在原始数据中的出现顺序分配排名'
obj.rank(method='first')
Out[104]:
0    6
1    1
2    7
3    4
4    3
5    2
6    5
dtype: float64
In [106]:
'ascending=False: 降序 '
'method="max": 使用整个分组的最大排名'
obj.rank(ascending=False, method='max')
Out[106]:
0    2
1    7
2    2
3    4
4    5
5    6
6    4
dtype: float64
In [107]:
frame = DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                   'c': [-2, 5, 8, -2.5]})
frame
Out[107]:
 abc
0 0 4.3 -2.0
1 1 7.0 5.0
2 0 -3.0 8.0
3 1 2.0 -2.5
In [108]:
frame.rank(axis=1)
Out[108]:
 abc
0 2 3 1
1 1 3 2
2 2 1 3
3 2 3 1
 

表5-8:排名时用于破坏平级关系的method=选项

'average'    默认:在相等分组中,为各个值分配平均排名
'min'       使用整个分组的最小排名
'max'       使用整个分组的最大排名          
'first'      按值在原始数据中的出现顺序分配排名
 

带有重复值的轴索引

In [109]:
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj #obj带有重复索引值
Out[109]:
a    0
a    1
b    2
b    3
c    4
dtype: int32
In [110]:
obj.index.is_unique #判断索引值是否唯一
Out[110]:
False
In [111]:
'重复索引的数据选取行为:'
obj['a']
Out[111]:
a    0
a    1
dtype: int32
In [112]:
obj['c']
Out[112]:
4
In [113]:
'DataFrame的重复索引情况:'
df = DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df
Out[113]:
 012
a 0.274992 0.228913 1.352917
a 0.886429 -2.001637 -0.371843
b 1.669025 -0.438570 -0.539741
b 0.476985 3.248944 -1.021228
In [114]:
df.ix['b']
Out[114]:
 012
b 1.669025 -0.438570 -0.539741
b 0.476985 3.248944 -1.021228
 

汇总和计算描述统计

In [116]:
df = DataFrame([[1.4, np.nan], [7.1, -4.5],
                [np.nan, np.nan], [0.75, -1.3]],
               index=['a', 'b', 'c', 'd'],
               columns=['one', 'two'])
df
Out[116]:
 onetwo
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3
In [117]:
df.sum()
Out[117]:
one    9.25
two   -5.80
dtype: float64
In [118]:
df.sum(axis=1)
Out[118]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
In [119]:
df.mean(axis=1, skipna=False)
Out[119]:
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
 

表5-9:约简方法的选项

axis= 约简的轴。DataFrame的行用0,列用1 skipna= 排除缺失值,默认值为True level= 如果轴是层次化索引的(即MultiIndex),则根据level分组约简

In [122]:
'其他情况:'
'idxmax反回间接统计:(到达最大值的索引)'
df.idxmax()
Out[122]:
one    b
two    d
dtype: object
In [123]:
'cumsum:累计型'
df.cumsum()
Out[123]:
 onetwo
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8
In [124]:
'describe:描述型'
df.describe()
Out[124]:
 onetwo
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000
In [125]:
'describe:非数值型的汇总方法:'
obj = Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
Out[125]:
count     16
unique     3
top        a
freq       8
dtype: object
 

表5-10:描述和汇总统计

count 非NA值的数量 describe 针对Series或各DataFrame列计算汇总统计 min,max 计算最小值和最大值 argmin,argmax 计算能够获取到最小值和最大值的索引位置(整数) idxmin,idxmax 计算能够获取到最小值和最大值的索引值 quantile 计算样本的分位数(0到1) sum 值的总和 mean 值的平均数 median 值的算术中位数(50%分位数) mad 根据平均值计算平均绝对离差 var 样本值的方差 std 样本值的标准差 skew 样本值的偏度(三阶矩) kurt 样本值的峰度(四阶矩) cumsum 样本值的累积和 cummin,cummax 样本值的累计最大值和累计最小值 cumprod 样本值的累计积 diff 计算一阶差分(对时间序列很有用) pct_change 计算百分数变化

 

相关系数与协方差

x.corr() 与 x.cov()
x.corrwith()

In [135]:
import pandas_datareader.data as web

all_data = {}
for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:
    all_data[ticker] = web.get_data_yahoo(ticker)

price = DataFrame({tic: data['Adj Close']
                   for tic, data in all_data.items()})
volume = DataFrame({tic: data['Volume']
                    for tic, data in all_data.items()})
In [136]:
'x.pct_change(): 计算百分数变化'
returns = price.pct_change()
returns.tail()
Out[136]:
 AAPLGOOGIBMMSFT
Date    
2016-03-11 0.010774 0.019640 0.015479 0.019597
2016-03-14 0.002542 0.005049 0.002950 0.001884
2016-03-15 0.020094 -0.002957 0.001261 0.007899
2016-03-16 0.013291 0.010655 0.012801 0.014182
2016-03-17 -0.001604 0.002296 0.015540 0.005704
In [137]:
'计算两个Series中的重叠的、非NA的、按索引对齐的值的相关系数'
returns.MSFT.corr(returns.IBM)
Out[137]:
0.50131900203209123
In [138]:
'计算协方差'
returns.MSFT.cov(returns.IBM)
Out[138]:
9.0385250591459499e-05
In [139]:
'完整的相关系数矩阵'
returns.corr()
Out[139]:
 AAPLGOOGIBMMSFT
AAPL 1.000000 0.409171 0.394914 0.396870
GOOG 0.409171 1.000000 0.400706 0.455242
IBM 0.394914 0.400706 1.000000 0.501319
MSFT 0.396870 0.455242 0.501319 1.000000
In [140]:
'完整的相关系数协方差'
returns.cov()
Out[140]:
 AAPLGOOGIBMMSFT
AAPL 0.000286 0.000113 0.000081 0.000099
GOOG 0.000113 0.000265 0.000079 0.000110
IBM 0.000081 0.000079 0.000148 0.000090
MSFT 0.000099 0.000110 0.000090 0.000219
In [141]:
'returns.corrwith(): 返回相关系数的Series'
returns.corrwith(returns.IBM)
Out[141]:
AAPL    0.394914
GOOG    0.400706
IBM     1.000000
MSFT    0.501319
dtype: float64
In [142]:
'传入DataFrame则会计算按列名配对的相关系数。这里计算百分比变化与成交量的相关系数'
returns.corrwith(volume)
Out[142]:
AAPL   -0.082436
GOOG   -0.003506
IBM    -0.203408
MSFT   -0.083732
dtype: float64
 

唯一值,值计数以及成员资格

In [143]:
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
In [144]:
uniques = obj.unique()
uniques
Out[144]:
array(['c', 'a', 'd', 'b'], dtype=object)
In [145]:
'value_counts的实例化方法'
obj.value_counts()
Out[145]:
a    3
c    3
b    2
d    1
dtype: int64
In [154]:
'注意obj.values与obj的不同:'
obj.values
Out[154]:
array(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'], dtype=object)
In [155]:
obj
Out[155]:
0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object
In [146]:
'value_counts的顶级pandas的方法'
pd.value_counts(obj.values, sort=False)
Out[146]:
c    3
d    1
b    2
a    3
dtype: int64
In [147]:
'isin:成员资格测试'
mask = obj.isin(['b', 'c'])
mask
Out[147]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
In [148]:
obj[mask]
Out[148]:
0    c
5    b
6    b
7    c
8    c
dtype: object
 

表5-11:唯一值、值计数、成员资格方法

isin 计算一个表示‘Series各值是否包含于传入的值序列中’的布尔型数组 unique 计算Series中的唯一值数组,按发现的顺序返回 value_counts 返回一个Series,其索引为唯一值,其值为频率,按计数值降序排列

In [149]:
data = DataFrame({'Qu1': [1, 3, 4, 3, 4],
                  'Qu2': [2, 3, 1, 2, 3],
                  'Qu3': [1, 5, 2, 4, 4]})
data
Out[149]:
 Qu1Qu2Qu3
0 1 2 1
1 3 3 5
2 4 1 2
3 3 2 4
4 4 3 4
In [156]:
'pd.value_counts传递给DataFrame的apply函数:'
result = data.apply(pd.value_counts).fillna(0)
result
Out[156]:
 Qu1Qu2Qu3
1 1 1 1
2 0 2 1
3 2 2 0
4 2 0 2
5 0 0 1
 

处理缺失的数据

In [157]:
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data
Out[157]:
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
In [158]:
string_data.isnull()
Out[158]:
0    False
1    False
2     True
3    False
dtype: bool
In [159]:
'None值被视为NaN值处理'
string_data[0] = None
string_data.isnull()
Out[159]:
0     True
1    False
2     True
3    False
dtype: bool
 

表5-12:NA处理方法

dropna 根据各标签的值中是否存在缺失数据对轴标签进行过滤,可通过阔值调节对缺失值的容忍度 fillna 用指定值或插值方法(如ffill或bfill)填充缺失数据 isnull 返回一个含有布尔值的对象,这些布尔值表示哪些值是缺失值N/A,该对象的类型与源类型一样 notnull isnull的否定式

 

滤除缺失数据

x.dropna()

In [160]:
'对于一个Series,dropna返回一个仅含非空数据和索引值的Series'
from numpy import nan as NA
data = Series([1, NA, 3.5, NA, 7])
data.dropna()
Out[160]:
0    1.0
2    3.5
4    7.0
dtype: float64
In [161]:
'也可通过布尔型索引过滤NA'
data[data.notnull()]
Out[161]:
0    1.0
2    3.5
4    7.0
dtype: float64
In [164]:
'对于一个DataFrame,dropna默认丢弃任何含有缺失值的行'
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                  [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()
data
Out[164]:
 012
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3
In [165]:
cleaned
Out[165]:
 012
0 1 6.5 3
In [167]:
'x.dropna(how="all"): 将只丢弃全为NA的那些行'
data.dropna(how='all')
Out[167]:
 012
0 1 6.5 3
1 1 NaN NaN
3 NaN 6.5 3
In [168]:
data[4] = NA
data
Out[168]:
 0124
0 1 6.5 3 NaN
1 1 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3 NaN
In [169]:
'传入axis=1,可以指定对列进行丢弃'
data.dropna(axis=1, how='all')
Out[169]:
 012
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3
In [172]:
'另一个滤除DataFrame行的问题设计时间序列数据。假设你只想留下一部分观测数据,可以用thresh参数实现此目的'
df = DataFrame(np.random.randn(7, 3))
df.ix[:4, 1] = NA; df.ix[:2, 2] = NA
df
Out[172]:
 012
0 0.286350 NaN NaN
1 0.331286 NaN NaN
2 0.246674 NaN NaN
3 1.327195 NaN -1.549106
4 0.022185 NaN -0.660524
5 0.862580 -0.010032 0.050009
6 0.670216 0.852965 -0.955869
In [171]:
'x.dropna(thresh=n): 剔除非参数个数<n的行或者列'
df.dropna(thresh=3)
Out[171]:
 012
5 0.332883 -2.359419 -0.199543
6 -1.541996 -0.970736 -1.307030
In [173]:
df.dropna(thresh=2)
Out[173]:
 012
3 1.327195 NaN -1.549106
4 0.022185 NaN -0.660524
5 0.862580 -0.010032 0.050009
6 0.670216 0.852965 -0.955869
In [174]:
df.dropna(thresh=1)
Out[174]:
 012
0 0.286350 NaN NaN
1 0.331286 NaN NaN
2 0.246674 NaN NaN
3 1.327195 NaN -1.549106
4 0.022185 NaN -0.660524
5 0.862580 -0.010032 0.050009
6 0.670216 0.852965 -0.955869
 

填充缺失数据

fillna():默认返回新对象,传入inplace会修改原有对象

In [175]:
'一般性的用法'
df.fillna(0)
Out[175]:
 012
0 0.286350 0.000000 0.000000
1 0.331286 0.000000 0.000000
2 0.246674 0.000000 0.000000
3 1.327195 0.000000 -1.549106
4 0.022185 0.000000 -0.660524
5 0.862580 -0.010032 0.050009
6 0.670216 0.852965 -0.955869
In [176]:
'传入字典:对列(key)进行自定义填充(value)'
df.fillna({1: 0.5, 3: -1})
Out[176]:
 012
0 0.286350 0.500000 NaN
1 0.331286 0.500000 NaN
2 0.246674 0.500000 NaN
3 1.327195 0.500000 -1.549106
4 0.022185 0.500000 -0.660524
5 0.862580 -0.010032 0.050009
6 0.670216 0.852965 -0.955869
In [177]:
'传入inplace会修改原对象'
_ = df.fillna(0, inplace=True)
df
Out[177]:
 012
0 0.286350 0.000000 0.000000
1 0.331286 0.000000 0.000000
2 0.246674 0.000000 0.000000
3 1.327195 0.000000 -1.549106
4 0.022185 0.000000 -0.660524
5 0.862580 -0.010032 0.050009
6 0.670216 0.852965 -0.955869
In [178]:
'对reindex有效的那些插值方法也可以用于fillna:'
df = DataFrame(np.random.randn(6, 3))
df.ix[2:, 1] = NA; df.ix[4:, 2] = NA
df
Out[178]:
 012
0 -0.023493 -2.304234 -0.652469
1 -1.218302 -1.332610 1.074623
2 0.723642 NaN 1.001543
3 -0.503087 NaN -0.921169
4 -0.726213 NaN NaN
5 -1.157719 NaN NaN
In [179]:
df.fillna(method='ffill')
Out[179]:
 012
0 -0.023493 -2.304234 -0.652469
1 -1.218302 -1.332610 1.074623
2 0.723642 -1.332610 1.001543
3 -0.503087 -1.332610 -0.921169
4 -0.726213 -1.332610 -0.921169
5 -1.157719 -1.332610 -0.921169
In [180]:
df.fillna(method='ffill', limit=2)
Out[180]:
 012
0 -0.023493 -2.304234 -0.652469
1 -1.218302 -1.332610 1.074623
2 0.723642 -1.332610 1.001543
3 -0.503087 -1.332610 -0.921169
4 -0.726213 NaN -0.921169
5 -1.157719 NaN -0.921169
In [181]:
'利用fillna实现其他的功能:例,传入平均值'
data = Series([1., NA, 3.5, NA, 7])
data.fillna(data.mean())
Out[181]:
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64
 

表5-13:fillna函数的参数

value 用于填充缺失值的标量值或字典对象 method 插值方式。如果函数调用时未指定其他参数的话,默认为‘ffill' axis 带填充的轴,默认axis=0 inplace 修改调用者对象而不产生副本 limit (对于前向和后向填充)可以连续填充的最大数量

 

层次化的索引

In [182]:
data = Series(np.random.randn(10),
              index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                     [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data
Out[182]:
a  1    1.010737
   2    1.824875
   3   -0.997518
b  1    0.850591
   2   -0.131578
   3    0.912414
c  1    0.188211
   2    2.169461
d  2   -0.114928
   3    2.003697
dtype: float64
In [184]:
data.index
Out[184]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
In [185]:
data['b']
Out[185]:
1    0.850591
2   -0.131578
3    0.912414
dtype: float64
In [186]:
'注意DataFrame的索引值切片与Python的索引切片的区别:前者两端都包括,后者只包括前端'
data['b':'c']
Out[186]:
b  1    0.850591
   2   -0.131578
   3    0.912414
c  1    0.188211
   2    2.169461
dtype: float64
In [187]:
data.ix[['b', 'd']]
Out[187]:
b  1    0.850591
   2   -0.131578
   3    0.912414
d  2   -0.114928
   3    2.003697
dtype: float64
In [188]:
'在内层中选取:'
data[:, 2]
Out[188]:
a    1.824875
b   -0.131578
c    2.169461
d   -0.114928
dtype: float64
In [189]:
'unstack: stack(堆,这里指层次化索引数据)转化成DataFrame'
data.unstack()
Out[189]:
 123
a 1.010737 1.824875 -0.997518
b 0.850591 -0.131578 0.912414
c 0.188211 2.169461 NaN
d NaN -0.114928 2.003697
In [190]:
data.unstack().stack()
Out[190]:
a  1    1.010737
   2    1.824875
   3   -0.997518
b  1    0.850591
   2   -0.131578
   3    0.912414
c  1    0.188211
   2    2.169461
d  2   -0.114928
   3    2.003697
dtype: float64
In [191]:
'对于一个DataFrame,每条轴都有可以分层索引'
frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])
frame
Out[191]:
  OhioColorado
  GreenRedGreen
a1 0 1 2
2 3 4 5
b1 6 7 8
2 9 10 11
In [192]:
'可以指定各层的名称:'
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
Out[192]:
 stateOhioColorado
 colorGreenRedGreen
key1key2   
a1 0 1 2
2 3 4 5
b1 6 7 8
2 9 10 11
In [193]:
frame['Ohio']
Out[193]:
 colorGreenRed
key1key2  
a1 0 1
2 3 4
b1 6 7
2 9 10
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']], names=['state', 'color'])
 

重排分级顺序

In [194]:
'x.swaplevel():层次级别互换'
frame.swaplevel('key1', 'key2')
Out[194]:
 stateOhioColorado
 colorGreenRedGreen
key2key1   
1a 0 1 2
2a 3 4 5
1b 6 7 8
2b 9 10 11
In [195]:
'sortlevel:单层排序'
frame.sortlevel(1)
Out[195]:
 stateOhioColorado
 colorGreenRedGreen
key1key2   
a1 0 1 2
b1 6 7 8
a2 3 4 5
b2 9 10 11
In [196]:
frame.swaplevel(0, 1).sortlevel(0)
Out[196]:
 stateOhioColorado
 colorGreenRedGreen
key2key1   
1a 0 1 2
b 6 7 8
2a 3 4 5
b 9 10 11
 

根据级别汇总统计

其实利用了pandas的groupby功能

In [197]:
'sum: 对DataFrame分层求和:'
frame.sum(level='key2')
Out[197]:
stateOhioColorado
colorGreenRedGreen
key2   
1 6 8 10
2 12 14 16
In [198]:
frame.sum(level='color', axis=1)
Out[198]:
 colorGreenRed
key1key2  
a1 2 1
2 8 4
b1 14 7
2 20 10
 

使用DataFrame的列

In [199]:
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})
frame
Out[199]:
 abcd
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
In [200]:
'set_index:将其中一个或者多个列转换为行索引,并移除转换的列'
frame2 = frame.set_index(['c', 'd'])
frame2
Out[200]:
  ab
cd  
one0 0 7
1 1 6
2 2 5
two0 3 4
1 4 3
2 5 2
3 6 1
In [201]:
'set_index(drop= ):列转换成索引,并保留列数据数据'
frame.set_index(['c', 'd'], drop=False)
Out[201]:
  abcd
cd    
one0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
In [202]:
'reset_index:与set_index功能相反,还原后,层次化索引会被转换到列的里面'
frame2.reset_index()
Out[202]:
 cdab
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
 

其他pandas话题

 

整数索引

In [207]:
ser = Series(np.arange(3.))
ser
Out[207]:
0    0
1    1
2    2
dtype: float64
In [ ]:
'对于整数索引,下面会出错,因为Series的轴索引含有索引器,根据整数进行数据选取的操作将总是面向标签的。 -1 不在标签里面,所以下面会出错!'
#ser[-1] #error
In [212]:
'但是下面使用没问题,因为标签里面有 1 '
ser[1]
Out[212]:
1.0
In [215]:
'以上规则同样适用于用ix进行切片'
'注意标签切片与索引切片的不同:标签切片两端都包括,索引切片只包括端首'
ser.ix[:1]
Out[215]:
0    0
1    1
dtype: float64
In [216]:
'对于非整数索引,则不存在这样的问题,因为此时根据整数选取数据实际是在进行切片操作'
ser2 = Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]
Out[216]:
2.0
 

总结:loc,at,ix,iloc,iat的区别:

loc和at传递标签参数,iloc和iat传递数值参数,用来索引和切片
loc和iloc可以返回一个区域或者一个值,但是at和iat只能返回固定值
ix既可以传递标签,也可以传递数值,只是传递数值的时候要遵守本节上面说的规则
他们都是pandas获取数据的内置方法。与传统的方法(直接[])相比,更推荐这样操作

In [213]:
'如上所述:使用pandas的iloc或者iat方法会默认把传入的整数作为索引处理'
ser.iloc[-1]
Out[213]:
2.0
In [217]:
ser3 = Series(range(3), index=[-5, 1, 3])
ser3.iloc[2]
Out[217]:
2
In [221]:
frame = DataFrame(np.arange(6).reshape((3, 2)), index=[2, 0, 1])
frame
Out[221]:
 01
2 0 1
0 2 3
1 4 5
In [218]:
frame.iloc[0]
Out[218]:
0    0
1    1
Name: 2, dtype: int32
In [228]:
frame.iloc[0,:]
Out[228]:
0    0
1    1
Name: 2, dtype: int32
 

面板数据

In [ ]:
import pandas_datareader.data as web

pdata = pd.Panel(dict((stk, web.get_data_yahoo(stk))
                       for stk in ['AAPL', 'GOOG', 'MSFT', 'DELL']))
In [237]:
pdata
Out[237]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 1582 (major_axis) x 6 (minor_axis)
Items axis: AAPL to MSFT
Major_axis axis: 2010-01-04 00:00:00 to 2016-03-17 00:00:00
Minor_axis axis: Open to Adj Close
In [238]:
pdata = pdata.swapaxes('items', 'minor')
pdata['Adj Close']
Out[238]:
 AAPLDELLGOOGMSFT
Date    
2010-01-04 28.313195 14.06528 313.062468 26.227603
2010-01-05 28.362145 14.38450 311.683844 26.236076
2010-01-06 27.911008 14.10397 303.826685 26.075067
2010-01-07 27.859412 14.23940 296.753749 25.803894
2010-01-08 28.044630 14.36516 300.709808 25.981851
2010-01-11 27.797232 14.37483 300.255255 25.651358
2010-01-12 27.481038 14.56830 294.945572 25.481874
2010-01-13 27.868673 14.57797 293.252243 25.719151
2010-01-14 27.707269 14.22005 294.630868 26.236076
2010-01-15 27.244224 13.92985 289.710772 26.151335
2010-01-19 28.449462 14.32646 293.516976 26.354715
2010-01-20 28.011555 14.03626 289.915587 25.922532
2010-01-21 27.527342 13.92017 291.199286 25.431029
2010-01-22 26.162022 13.18982 274.730736 24.541239
2010-01-25 26.865850 13.43650 269.730740 24.846310
2010-01-26 27.245547 13.13662 270.939526 24.998845
2010-01-27 27.502206 13.08825 270.779695 25.142907
2010-01-28 26.365761 12.84641 266.878565 24.710723
2010-01-29 25.409244 12.47882 264.705742 23.880253
2010-02-01 25.762481 12.78837 266.244198 24.075159
2010-02-02 25.911978 12.86576 265.295156 24.117529
2010-02-03 26.357823 12.92380 270.140309 24.261591
2010-02-04 25.407921 12.58523 263.127321 23.592131
2010-02-05 25.859059 12.80772 265.380075 23.744666
2010-02-08 25.681779 12.95282 266.468996 23.490440
2010-02-09 25.955637 13.10760 267.952522 23.736192
2010-02-10 25.814078 13.30107 266.958496 23.719243
2010-02-11 26.283736 13.49454 267.932539 23.829409
2010-02-12 26.509966 13.38813 266.294170 23.668399
2010-02-16 26.909508 13.67834 270.380071 24.136659
... ... ... ... ...
2016-02-04 96.599998 NaN 708.010010 51.629307
2016-02-05 94.019997 NaN 683.570007 49.802423
2016-02-08 95.010002 NaN 682.739990 49.057770
2016-02-09 94.989998 NaN 678.109985 48.928696
2016-02-10 94.269997 NaN 684.119995 49.355631
2016-02-11 93.699997 NaN 683.109985 49.335773
2016-02-12 93.989998 NaN 682.400024 50.140000
2016-02-16 96.639999 NaN 691.000000 51.090000
2016-02-17 98.120003 NaN 708.400024 52.419998
2016-02-18 96.260002 NaN 697.349976 52.189999
2016-02-19 96.040001 NaN 700.909973 51.820000
2016-02-22 96.879997 NaN 706.460022 52.650002
2016-02-23 94.690002 NaN 695.849976 51.180000
2016-02-24 96.099998 NaN 699.559998 51.360001
2016-02-25 96.760002 NaN 705.750000 52.099998
2016-02-26 96.910004 NaN 705.070007 51.299999
2016-02-29 96.690002 NaN 697.770020 50.880001
2016-03-01 100.529999 NaN 718.809998 52.580002
2016-03-02 100.750000 NaN 718.849976 52.950001
2016-03-03 101.500000 NaN 712.419983 52.349998
2016-03-04 103.010002 NaN 710.890015 52.029999
2016-03-07 101.870003 NaN 695.159973 51.029999
2016-03-08 101.029999 NaN 693.969971 51.650002
2016-03-09 101.120003 NaN 705.239990 52.840000
2016-03-10 101.169998 NaN 712.820007 52.049999
2016-03-11 102.260002 NaN 726.820007 53.070000
2016-03-14 102.519997 NaN 730.489990 53.169998
2016-03-15 104.580002 NaN 728.330017 53.590000
2016-03-16 105.970001 NaN 736.090027 54.349998
2016-03-17 105.800003 NaN 737.780029 54.660000

1582 rows × 4 columns

In [239]:
pdata.ix[:, '6/1/2012', :]
Out[239]:
 OpenHighLowCloseVolumeAdj Close
AAPL 569.159996 572.650009 560.520012 560.989983 130246900 74.218116
DELL 12.150000 12.300000 12.045000 12.070000 19397600 11.675920
GOOG 571.790972 572.650996 568.350996 570.981000 6138700 285.205295
MSFT 28.760000 28.959999 28.440001 28.450001 56634300 25.598227
In [240]:
'这里不用.ix也是一样的,两者实际中的性能一样。'
pdata[:, '6/1/2012', :]
Out[240]:
 OpenHighLowCloseVolumeAdj Close
AAPL 569.159996 572.650009 560.520012 560.989983 130246900 74.218116
DELL 12.150000 12.300000 12.045000 12.070000 19397600 11.675920
GOOG 571.790972 572.650996 568.350996 570.981000 6138700 285.205295
MSFT 28.760000 28.959999 28.440001 28.450001 56634300 25.598227
In [243]:
%timeit pdata[:, '6/1/2012', :]
 
1000 loops, best of 3: 532 µs per loop
In [244]:
%timeit pdata.ix[:, '6/1/2012', :]
 
1000 loops, best of 3: 518 µs per loop
In [245]:
pdata.ix['Adj Close', '5/22/2012':, :]
Out[245]:
 AAPLDELLGOOGMSFT
Date    
2012-05-22 73.686282 14.58765 300.100412 26.776915
2012-05-23 75.484211 12.08221 304.426106 26.192070
2012-05-24 74.790973 12.04351 301.528978 26.156079
2012-05-25 74.390105 12.05319 295.470050 26.147081
2012-05-28 NaN 12.05319 NaN NaN
2012-05-29 75.710442 12.24666 296.873645 26.596962
2012-05-30 76.623304 12.14992 293.821674 26.399015
2012-05-31 76.432797 11.92743 290.140354 26.264051
2012-06-01 74.218116 11.67592 285.205295 25.598227
2012-06-04 74.654700 11.60821 289.006480 25.688202
2012-06-05 74.461551 11.76298 284.920579 25.652212
2012-06-06 75.603286 11.81619 289.995487 26.408013
2012-06-07 75.637681 11.73396 288.826666 26.300040
2012-06-08 76.775446 11.72429 289.935570 26.677940
2012-06-11 75.564914 11.47278 283.966519 26.003119
2012-06-12 76.225086 11.57919 282.268201 26.354027
2012-06-13 75.695894 11.87423 280.265216 26.210064
2012-06-14 75.612542 11.93711 279.246220 26.399015
2012-06-15 75.956519 11.89841 281.973510 27.010853
2012-06-18 77.497794 12.01449 285.140358 26.848896
2012-06-19 77.713448 11.78233 290.475012 27.622691
2012-06-20 77.492502 11.89841 288.467038 27.829636
2012-06-21 76.424856 11.60821 282.323161 27.118824
2012-06-22 77.010939 11.80168 285.455033 27.622691
2012-06-25 75.512000 11.55500 280.070407 26.875889
2012-06-26 75.678696 11.53565 282.058429 27.010853
2012-06-27 76.005469 11.92743 284.366112 27.145817
2012-06-28 75.284441 11.55984 281.873596 26.911879
2012-06-29 77.262308 12.10155 289.745749 27.523717
2012-07-02 78.389489 11.98064 289.945546 27.496724
... ... ... ... ...
2016-02-04 96.599998 NaN 708.010010 51.629307
2016-02-05 94.019997 NaN 683.570007 49.802423
2016-02-08 95.010002 NaN 682.739990 49.057770
2016-02-09 94.989998 NaN 678.109985 48.928696
2016-02-10 94.269997 NaN 684.119995 49.355631
2016-02-11 93.699997 NaN 683.109985 49.335773
2016-02-12 93.989998 NaN 682.400024 50.140000
2016-02-16 96.639999 NaN 691.000000 51.090000
2016-02-17 98.120003 NaN 708.400024 52.419998
2016-02-18 96.260002 NaN 697.349976 52.189999
2016-02-19 96.040001 NaN 700.909973 51.820000
2016-02-22 96.879997 NaN 706.460022 52.650002
2016-02-23 94.690002 NaN 695.849976 51.180000
2016-02-24 96.099998 NaN 699.559998 51.360001
2016-02-25 96.760002 NaN 705.750000 52.099998
2016-02-26 96.910004 NaN 705.070007 51.299999
2016-02-29 96.690002 NaN 697.770020 50.880001
2016-03-01 100.529999 NaN 718.809998 52.580002
2016-03-02 100.750000 NaN 718.849976 52.950001
2016-03-03 101.500000 NaN 712.419983 52.349998
2016-03-04 103.010002 NaN 710.890015 52.029999
2016-03-07 101.870003 NaN 695.159973 51.029999
2016-03-08 101.029999 NaN 693.969971 51.650002
2016-03-09 101.120003 NaN 705.239990 52.840000
2016-03-10 101.169998 NaN 712.820007 52.049999
2016-03-11 102.260002 NaN 726.820007 53.070000
2016-03-14 102.519997 NaN 730.489990 53.169998
2016-03-15 104.580002 NaN 728.330017 53.590000
2016-03-16 105.970001 NaN 736.090027 54.349998
2016-03-17 105.800003 NaN 737.780029 54.660000

975 rows × 4 columns

In [246]:
'to_frame: 层叠式堆积DataFrame来显示部分Panel'
stacked = pdata.ix[:, '5/30/2012':, :].to_frame()
stacked
Out[246]:
  OpenHighLowCloseVolumeAdj Close
Dateminor      
2012-05-30AAPL 569.199997 579.989990 566.559990 579.169998 132357400 76.623304
DELL 12.590000 12.700000 12.460000 12.560000 19787800 12.149920
GOOG 588.161028 591.901014 583.530999 588.230992 3827600 293.821674
MSFT 29.350000 29.480000 29.120001 29.340000 41585500 26.399015
2012-05-31AAPL 580.740021 581.499985 571.460022 577.730019 122918600 76.432797
DELL 12.530000 12.540000 12.330000 12.330000 19955600 11.927430
GOOG 588.720982 590.001032 579.001013 580.860990 5958800 290.140354
MSFT 29.299999 29.420000 28.940001 29.190001 39134000 26.264051
2012-06-01AAPL 569.159996 572.650009 560.520012 560.989983 130246900 74.218116
DELL 12.150000 12.300000 12.045000 12.070000 19397600 11.675920
GOOG 571.790972 572.650996 568.350996 570.981000 6138700 285.205295
MSFT 28.760000 28.959999 28.440001 28.450001 56634300 25.598227
2012-06-04AAPL 561.500008 567.499985 548.499977 564.289978 139248900 74.654700
DELL 12.110000 12.112500 11.800000 12.000000 17015700 11.608210
GOOG 570.220958 580.491016 570.011006 578.590973 4883500 289.006480
MSFT 28.620001 28.780001 28.320000 28.549999 47926300 25.688202
2012-06-05AAPL 561.269989 566.470001 558.330002 562.830025 97053600 74.461551
DELL 11.950000 12.240000 11.950000 12.160000 15620900 11.762980
GOOG 575.451008 578.131003 566.470986 570.410999 4697200 284.920579
MSFT 28.510000 28.750000 28.389999 28.510000 45715400 25.652212
2012-06-06AAPL 567.770004 573.849983 565.499992 571.460022 100363900 75.603286
DELL 12.210000 12.280000 12.090000 12.215000 20779900 11.816190
GOOG 576.480979 581.970971 573.611004 580.570966 4207200 289.995487
MSFT 28.879999 29.370001 28.809999 29.350000 46860500 26.408013
2012-06-07AAPL 577.290009 577.320023 570.500000 571.720001 94941700 75.637681
DELL 12.320000 12.410000 12.120000 12.130000 20074000 11.733960
GOOG 587.601014 587.891038 577.251006 578.230986 3530100 288.826666
MSFT 29.639999 29.700001 29.170000 29.230000 37792800 26.300040
2012-06-08AAPL 571.599998 580.580017 568.999992 580.319984 86879100 76.775446
DELL 12.130000 12.225000 12.020000 12.120000 18155600 11.724290
...... ... ... ... ... ... ...
2016-03-04AAPL 102.370003 103.750000 101.370003 103.010002 45936500 103.010002
GOOG 714.989990 716.489990 706.020020 710.890015 1967900 710.890015
MSFT 52.400002 52.450001 51.709999 52.029999 32280400 52.029999
2016-03-07AAPL 102.389999 102.830002 100.959999 101.870003 35828900 101.870003
GOOG 706.900024 708.091003 686.900024 695.159973 2985100 695.159973
MSFT 51.560001 51.799999 50.580002 51.029999 38407800 51.029999
2016-03-08AAPL 100.779999 101.760002 100.400002 101.029999 31274200 101.029999
GOOG 688.590027 703.789978 685.340027 693.969971 2063400 693.969971
MSFT 50.799999 52.130001 50.599998 51.650002 33352200 51.650002
2016-03-09AAPL 101.309998 101.580002 100.269997 101.120003 27130700 101.120003
GOOG 698.469971 705.679993 694.000000 705.239990 1418700 705.239990
MSFT 51.889999 52.849998 51.860001 52.840000 28145200 52.840000
2016-03-10AAPL 101.410004 102.239998 100.150002 101.169998 33470400 101.169998
GOOG 708.119995 716.440002 703.359985 712.820007 2829400 712.820007
MSFT 52.930000 52.939999 51.160000 52.049999 38384200 52.049999
2016-03-11AAPL 102.239998 102.279999 101.500000 102.260002 27200800 102.260002
GOOG 720.000000 726.919983 717.125000 726.820007 1963900 726.820007
MSFT 53.000000 53.070000 52.380001 53.070000 32174300 53.070000
2016-03-14AAPL 101.910004 102.910004 101.779999 102.519997 25027400 102.519997
GOOG 726.809998 735.500000 725.150024 730.489990 1716900 730.489990
MSFT 52.709999 53.590000 52.630001 53.169998 23751100 53.169998
2016-03-15AAPL 103.959999 105.180000 103.849998 104.580002 39982500 104.580002
GOOG 726.919983 732.289978 724.770020 728.330017 1720100 728.330017
MSFT 52.750000 53.590000 52.740002 53.590000 20967600 53.590000
2016-03-16AAPL 104.610001 106.309998 104.589996 105.970001 37893800 105.970001
GOOG 726.369995 737.469971 724.510010 736.090027 1572300 736.090027
MSFT 53.450001 54.599998 53.400002 54.349998 31297900 54.349998
2016-03-17AAPL 105.519997 106.470001 104.959999 105.800003 34244600 105.800003
GOOG 736.450012 743.070007 736.000000 737.780029 1856800 737.780029
MSFT 54.209999 55.000000 54.000000 54.660000 28186500 54.660000

3238 rows × 6 columns

In [247]:
'to_panel: to_frame的逆运算'
stacked.to_panel()
Out[247]:
<class 'pandas.core.panel.Panel'>
Dimensions: 6 (items) x 969 (major_axis) x 4 (minor_axis)
Items axis: Open to Adj Close
Major_axis axis: 2012-05-30 00:00:00 to 2016-03-17 00:00:00
Minor_axis axis: AAPL to MSFT
In [ ]:
 
posted @ 2017-04-02 17:56  she35  阅读(159)  评论(0)    收藏  举报