30分钟了解pandas
参考资料:https://pandas.pydata.org/docs/user_guide/10min.html
创建对象
创建一个Series对象
In [150]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [151]: s Out[151]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
创建一个DataFrame对象通过NumPy的数组
In [152]: dates = pd.date_range('20130101', periods=6)
In [153]: dates
Out[153]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [154]: df = pd.DataFrame(np.random.randn(6,4),index=dates, columns=list('ABCD'))
In [155]: df
Out[155]:
A B C D
2013-01-01 0.911951 0.119077 2.244598 -1.524168
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
2013-01-04 1.222247 -0.651681 1.764630 0.500507
2013-01-05 -0.332192 -1.424871 -0.680254 0.370109
2013-01-06 -0.724522 -1.238991 0.223425 -1.775954
通过一个字典来生成一个df对象
In [179]: df2 = pd.DataFrame(
...: ...: {
...: ...: "A": 1.0,
...: ...: "B": pd.Timestamp("20130102"),
...: ...: "C": pd.Series(1, index=list(range(4)), dtype="float32"),
...: ...: "D": np.array([3] * 4, dtype="int32"),
...: ...: "E": pd.Categorical(["test", "train", "test", "train"]),
...: ...: "F": "foo",
...: ...: },
...: ...:
...: )
In [180]: df2
Out[180]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [181]: df2.dtypes
Out[181]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
In [182]:
通过dtypes可以查看相关索引的内容类型。
查看数据
通过head, tail,可以查看部分内容,通过index与columns可以查看行索引与列索引的信息
In [183]: df.head()
Out[183]:
A B C D
2013-01-01 0.911951 0.119077 2.244598 -1.524168
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
2013-01-04 1.222247 -0.651681 1.764630 0.500507
2013-01-05 -0.332192 -1.424871 -0.680254 0.370109
In [184]: df.tail(2)
Out[184]:
A B C D
2013-01-05 -0.332192 -1.424871 -0.680254 0.370109
2013-01-06 -0.724522 -1.238991 0.223425 -1.775954
In [185]: df.head(1)
Out[185]:
A B C D
2013-01-01 0.911951 0.119077 2.244598 -1.524168
In [186]: df.index
Out[186]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [187]: df.columns
Out[187]: Index(['A', 'B', 'C', 'D'], dtype='object')
In [188]:
通过DataFrame.to_numpy()可以给你一个NumPy的array数据,但有时候这会很耗费电脑,因为pandas是根据每一列一个dtypes的,而NumPy的array是整个数据一个dtype的
In [194]: df.to_numpy()
Out[194]:
array([[ 0.9119509 , 0.11907694, 2.24459767, -1.52416844],
[-0.71159066, 1.81432742, 0.85962346, -0.24911614],
[-0.0414173 , -1.15847237, 1.03775241, 1.12435552],
[ 1.22224697, -0.65168145, 1.76462966, 0.50050719],
[-0.33219183, -1.42487132, -0.68025439, 0.37010889],
[-0.72452176, -1.23899146, 0.22342519, -1.77595409]])
In [195]: df2.to_numpy()
Out[195]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
上面演示了两种to_numpy的方式,其中第二种的花费要多很多。
describe() 将会展示一些数据的基础信息,比如统计数量,标准差,最小值,最大值.......
In [202]: df.describe()
Out[202]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.054079 -0.423435 0.908296 -0.259045
std 0.830825 1.229820 1.051730 1.165178
min -0.724522 -1.424871 -0.680254 -1.775954
25% -0.616741 -1.218862 0.382475 -1.205405
50% -0.186805 -0.905077 0.948688 0.060496
75% 0.673609 -0.073613 1.582910 0.467908
max 1.222247 1.814327 2.244598 1.124356
通过T可以把行索引与列索引进行转置。
In [204]: df.T Out[204]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.911951 -0.711591 -0.041417 1.222247 -0.332192 -0.724522 B 0.119077 1.814327 -1.158472 -0.651681 -1.424871 -1.238991 C 2.244598 0.859623 1.037752 1.764630 -0.680254 0.223425 D -1.524168 -0.249116 1.124356 0.500507 0.370109 -1.775954
通过索引进行排序
通过sort_index按照索引的数值大小,可以进行排序
In [207]: df.sort_index(ascending=False)
Out[207]:
A B C D
2013-01-06 -0.724522 -1.238991 0.223425 -1.775954
2013-01-05 -0.332192 -1.424871 -0.680254 0.370109
2013-01-04 1.222247 -0.651681 1.764630 0.500507
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-01 0.911951 0.119077 2.244598 -1.524168
In [208]: df.sort_index?
In [209]: df.sort_index(axis=1, ascending=False)
Out[209]:
D C B A
2013-01-01 -1.524168 2.244598 0.119077 0.911951
2013-01-02 -0.249116 0.859623 1.814327 -0.711591
2013-01-03 1.124356 1.037752 -1.158472 -0.041417
2013-01-04 0.500507 1.764630 -0.651681 1.222247
2013-01-05 0.370109 -0.680254 -1.424871 -0.332192
2013-01-06 -1.775954 0.223425 -1.238991 -0.724522
当然最后还可以通过具体某一列的数值大小进行排序
In [212]: df.sort_values(by='B')
Out[212]:
A B C D
2013-01-05 -0.332192 -1.424871 -0.680254 0.370109
2013-01-06 -0.724522 -1.238991 0.223425 -1.775954
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
2013-01-04 1.222247 -0.651681 1.764630 0.500507
2013-01-01 0.911951 0.119077 2.244598 -1.524168
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
选择
pandas建议用过.at, .iat, .loc and .iloc.这4类方法进行取值。
选择单列数据,返回的是一个Series对象,单个数值的时候返回列数据
In [226]: df['A'] Out[226]: 2013-01-01 0.911951 2013-01-02 -0.711591 2013-01-03 -0.041417 2013-01-04 1.222247 2013-01-05 -0.332192 2013-01-06 -0.724522 Freq: D, Name: A, dtype: float64
通过切片取值的时候返回多行数据的df对象。[理解为被切片的df对象],切片为取头取尾。
In [229]: df[0:3]
Out[229]:
A B C D
2013-01-01 0.911951 0.119077 2.244598 -1.524168
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
In [230]: df["20130102":'20130103']
Out[230]:
A B C D
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
通过便签取值,国内书中有说是显式取值。通过loc的方法,注意调用该方法时不用()直接后面加[]就可以.
In [238]: df.loc[dates[0]] Out[238]: A 0.911951 B 0.119077 C 2.244598 D -1.524168 Name: 2013-01-01 00:00:00, dtype: float64
取多列坐标数据
In [239]: df.loc[:,['A','B']]
Out[239]:
A B
2013-01-01 0.911951 0.119077
2013-01-02 -0.711591 1.814327
2013-01-03 -0.041417 -1.158472
2013-01-04 1.222247 -0.651681
2013-01-05 -0.332192 -1.424871
2013-01-06 -0.724522 -1.238991
通过切片选择不同的行与列的信息
In [240]: df.loc["20130102":"20130104",["A","B"]]
Out[240]:
A B
2013-01-02 -0.711591 1.814327
2013-01-03 -0.041417 -1.158472
2013-01-04 1.222247 -0.651681
只选取一行,选取指定列
In [241]: df.loc["20130102",["A","B"]] Out[241]: A -0.711591 B 1.814327 Name: 2013-01-02 00:00:00, dtype: float64
最后介绍两种方式取出单个数值,也就是所谓的标量
In [242]: df.loc[dates[0],"A"] Out[242]: 0.91195089904327 In [243]: df.at[dates[0],"A"] Out[243]: 0.91195089904327 In [244]:
通过位置来取值,也就是所谓的隐式传参 iloc
跟loc使用差不多,可以传入单值,多值,还有切片
单值,返回某一行的数据的Series对象
In [244]: df.iloc[0] Out[244]: A 0.911951 B 0.119077 C 2.244598 D -1.524168 Name: 2013-01-01 00:00:00, dtype: float64 In [245]: df.iloc[3] Out[245]: A 1.222247 B -0.651681 C 1.764630 D 0.500507 Name: 2013-01-04 00:00:00, dtype: float64
也可以通过切片,这是跟Python一样,取头不取尾的
In [246]: df.iloc[3:5,0:2]
Out[246]:
A B
2013-01-04 1.222247 -0.651681
2013-01-05 -0.332192 -1.424871
还可以传入多个位置参数来取,位置参数之间用逗号分割
In [247]: df.iloc[[1,2,4],[0,2]]
Out[247]:
A C
2013-01-02 -0.711591 0.859623
2013-01-03 -0.041417 1.037752
2013-01-05 -0.332192 -0.680254
可以通过单个:来全选数据
In [248]: df.iloc[1:3]
Out[248]:
A B C D
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
In [249]: df.iloc[1:3,:]
Out[249]:
A B C D
2013-01-02 -0.711591 1.814327 0.859623 -0.249116
2013-01-03 -0.041417 -1.158472 1.037752 1.124356
In [250]: df.iloc[:,1:3]
Out[250]:
B C
2013-01-01 0.119077 2.244598
2013-01-02 1.814327 0.859623
2013-01-03 -1.158472 1.037752
2013-01-04 -0.651681 1.764630
2013-01-05 -1.424871 -0.680254
2013-01-06 -1.238991 0.223425
也可以通过输入两个数值坐标来取标量。
In [251]: df.iloc[1,1] Out[251]: 1.8143274155708045 In [252]: df.iat[1,1] Out[252]: 1.8143274155708045
bolllean 索引,有些书称为掩码取值
使用单一的列数据来选择数据
In [8]: df[df['A'] > 0]
Out[8]:
A B C D
2013-01-02 0.314732 1.187635 -0.290305 -1.091247
2013-01-03 0.535155 0.453357 0.308055 -0.533740
2013-01-04 0.688215 -1.474558 -2.561580 -1.211188
2013-01-06 1.104539 -1.272097 1.423444 0.417609
针对df对象所有对象进行选值
In [9]: df[df>0]
Out[9]:
A B C D
2013-01-01 NaN 0.954575 NaN 0.070635
2013-01-02 0.314732 1.187635 NaN NaN
2013-01-03 0.535155 0.453357 0.308055 NaN
2013-01-04 0.688215 NaN NaN NaN
2013-01-05 NaN NaN 1.522803 NaN
2013-01-06 1.104539 NaN 1.423444 0.417609
使用isin()来选择数据
In [14]: df2
Out[14]:
A B C D E
2013-01-01 -1.629264 0.954575 -0.102982 0.070635 one
2013-01-02 0.314732 1.187635 -0.290305 -1.091247 one
2013-01-03 0.535155 0.453357 0.308055 -0.533740 two
2013-01-04 0.688215 -1.474558 -2.561580 -1.211188 three
2013-01-05 -1.579950 -0.317184 1.522803 -0.702557 four
2013-01-06 1.104539 -1.272097 1.423444 0.417609 three
In [15]: df2[df2['E'].isin(['two','four'])]
Out[15]:
A B C D E
2013-01-03 0.535155 0.453357 0.308055 -0.533740 two
2013-01-05 -1.579950 -0.317184 1.522803 -0.702557 four
设定
设置一个新的列将自动对齐数据的索引。
In [16]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))
In [17]: s1
Out[17]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [18]: df['F'] = s1
分别通过便签与索引设定值
In [19]: df.at[dates[0],"A"] = 0 In [20]: df.iat[0,1] = 0
通过NumPy的数组来分配数据
In [26]: df.loc[:,"D"] = np.array([5] * len(df))
In [27]: df
Out[27]:
A B C D F
2013-01-01 0.000000 0.000000 -0.102982 5 NaN
2013-01-02 0.314732 1.187635 -0.290305 5 1.0
2013-01-03 0.535155 0.453357 0.308055 5 2.0
2013-01-04 0.688215 -1.474558 -2.561580 5 3.0
2013-01-05 -1.579950 -0.317184 1.522803 5 4.0
2013-01-06 1.104539 -1.272097 1.423444 5 5.0
操作具体选定的数据
In [30]: df2
Out[30]:
A B C D F
2013-01-01 0.000000 0.000000 -0.102982 5 NaN
2013-01-02 0.314732 1.187635 -0.290305 5 1.0
2013-01-03 0.535155 0.453357 0.308055 5 2.0
2013-01-04 0.688215 -1.474558 -2.561580 5 3.0
2013-01-05 -1.579950 -0.317184 1.522803 5 4.0
2013-01-06 1.104539 -1.272097 1.423444 5 5.0
In [31]: df2[df2>0] = -df2
In [32]: df2
Out[32]:
A B C D F
2013-01-01 0.000000 0.000000 -0.102982 -5 NaN
2013-01-02 -0.314732 -1.187635 -0.290305 -5 -1.0
2013-01-03 -0.535155 -0.453357 -0.308055 -5 -2.0
2013-01-04 -0.688215 -1.474558 -2.561580 -5 -3.0
2013-01-05 -1.579950 -0.317184 -1.522803 -5 -4.0
2013-01-06 -1.104539 -1.272097 -1.423444 -5 -5.0
缺失数据
pandas主要使用np.nan来表示缺失值,一般情况下它不参与计算。
使用Reindexing可以改变指定象限的索引值。
In [17]: df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ["E"])
In [18]: df1
Out[18]:
A B C D F E
2013-01-01 0.000000 0.000000 0.009247 5 NaN NaN
2013-01-02 1.012606 0.986936 1.007574 5 1.0 NaN
2013-01-03 -0.659537 -0.193782 -0.037052 5 2.0 NaN
2013-01-04 1.957053 1.212591 0.639196 5 3.0 NaN
In [19]: df1.loc[dates[0]:dates[1],'E'] = 1
In [20]: df1
Out[20]:
A B C D F E
2013-01-01 0.000000 0.000000 0.009247 5 NaN 1.0
2013-01-02 1.012606 0.986936 1.007574 5 1.0 1.0
2013-01-03 -0.659537 -0.193782 -0.037052 5 2.0 NaN
2013-01-04 1.957053 1.212591 0.639196 5 3.0 NaN
放弃任何有nan的行
In [21]: df1.dropna(how='any')
Out[21]:
A B C D F E
2013-01-02 1.012606 0.986936 1.007574 5 1.0 1.0
填写缺失的数据
In [22]: df1.fillna(value=5)
Out[22]:
A B C D F E
2013-01-01 0.000000 0.000000 0.009247 5 5.0 1.0
2013-01-02 1.012606 0.986936 1.007574 5 1.0 1.0
2013-01-03 -0.659537 -0.193782 -0.037052 5 2.0 5.0
2013-01-04 1.957053 1.212591 0.639196 5 3.0 5.0
通过bool掩码值获取是否是nan的df对象
In [23]: pd.isna(df1)
Out[23]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
Operations[操作]
Stats[操作]
操作一般不包括缺失的数据,进行描述性统计
In [12]: df
Out[12]:
A B C D F
2013-01-01 0.000000 0.000000 0.913359 5 NaN
2013-01-02 -1.970074 0.380875 -0.648481 5 1.0
2013-01-03 0.281023 0.861409 0.084303 5 2.0
2013-01-04 -0.795286 1.961307 1.569313 5 3.0
2013-01-05 0.974005 0.648449 -1.928545 5 4.0
2013-01-06 0.000812 -0.274185 -0.469916 5 5.0
In [13]: df.mean()
Out[13]:
A -0.251587
B 0.596309
C -0.079994
D 5.000000
F 3.000000
dtype: float64
In [14]: df.mean(1)
Out[14]:
2013-01-01 1.478340
2013-01-02 0.752464
2013-01-03 1.645347
2013-01-04 2.147067
2013-01-05 1.738782
2013-01-06 1.851342
Freq: D, dtype: float64
针对不同的象限取平均。
操作不同维度和需要对齐的对象。pandas将会自动沿指定的维度广播。
In [27]: s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)
In [28]: s
Out[28]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [29]: df.sub?
In [30]: df.sub(s,axis='index')
Out[30]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -0.718977 -0.138591 -0.915697 4.0 1.0
2013-01-04 -3.795286 -1.038693 -1.430687 2.0 0.0
2013-01-05 -4.025995 -4.351551 -6.928545 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
Apply
申请
Applying functions to the data:
将函数应用与数据
In [31]: df
Out[31]:
A B C D F
2013-01-01 0.000000 0.000000 0.913359 5 NaN
2013-01-02 -1.970074 0.380875 -0.648481 5 1.0
2013-01-03 0.281023 0.861409 0.084303 5 2.0
2013-01-04 -0.795286 1.961307 1.569313 5 3.0
2013-01-05 0.974005 0.648449 -1.928545 5 4.0
2013-01-06 0.000812 -0.274185 -0.469916 5 5.0
In [32]: df.apply(np.cumsum)
Out[32]:
A B C D F
2013-01-01 0.000000 0.000000 0.913359 5 NaN
2013-01-02 -1.970074 0.380875 0.264878 10 1.0
2013-01-03 -1.689051 1.242284 0.349182 15 3.0
2013-01-04 -2.484337 3.203590 1.918495 20 6.0
2013-01-05 -1.510332 3.852039 -0.010051 25 10.0
2013-01-06 -1.509519 3.577854 -0.479966 30 15.0
In [33]: df.apply?
In [34]: df.apply(lambda x: x.max() - x.min())
Out[34]:
A 2.944080
B 2.235492
C 3.497859
D 0.000000
F 4.000000
dtype: float64
In [35]:
默认对列数据进行操作,添加一个函数。
Histogramming¶
柱状图
See more at Histogramming and Discretization.
In [35]: s = pd.Series(np.random.randint(0,7,size=10)) In [36]: s Out[36]: 0 1 1 0 2 1 3 3 4 4 5 0 6 0 7 5 8 0 9 1 dtype: int64 In [37]: s.value_counts() Out[37]: 0 4 1 3 5 1 4 1 3 1 dtype: int64 In [38]: s.value_counts().__class__ Out[38]: pandas.core.series.Series In [39]: s.value_counts().index Out[39]: Int64Index([0, 1, 5, 4, 3], dtype='int64')
前面的索引时出现的数字,后面是出现的次数。
String Methods
字符串方法
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
Series在str属性中配备了一组字符串处理方法,这些方法可以轻松地对数组的每个元素进行操作,
In [40]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"]) In [41]: s.str.lower() Out[41]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object In [42]:
Merge¶
合并
Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat():
用concat连接pandas对象
In [6]: df = pd.DataFrame(np.random.randn(10, 4),index=[chr(i) for i in range(65,75)])
In [7]: df
Out[7]:
0 1 2 3
A -0.352374 1.278078 0.543435 0.085902
B -0.572048 -0.053377 1.278151 -0.143507
C 0.058471 -0.678905 -0.928697 -0.275203
D -0.338097 2.616637 -1.541948 0.127263
E -0.460966 -1.571678 -1.958119 -0.533291
F -0.832041 -0.103917 1.635657 0.887229
G -0.171535 -2.181684 -1.682033 0.462375
H -1.001720 0.925746 -0.205054 -1.493562
I -0.213802 -1.214741 -0.864377 0.735728
J -0.375842 0.406692 0.207853 0.626410
In [8]: prices = [df[:3],df[3:7],df[7:]]
In [9]: prices
Out[9]:
[ 0 1 2 3
A -0.352374 1.278078 0.543435 0.085902
B -0.572048 -0.053377 1.278151 -0.143507
C 0.058471 -0.678905 -0.928697 -0.275203,
0 1 2 3
D -0.338097 2.616637 -1.541948 0.127263
E -0.460966 -1.571678 -1.958119 -0.533291
F -0.832041 -0.103917 1.635657 0.887229
G -0.171535 -2.181684 -1.682033 0.462375,
0 1 2 3
H -1.001720 0.925746 -0.205054 -1.493562
I -0.213802 -1.214741 -0.864377 0.735728
J -0.375842 0.406692 0.207853 0.626410]
In [10]: pd.concat(prices)
Out[10]:
0 1 2 3
A -0.352374 1.278078 0.543435 0.085902
B -0.572048 -0.053377 1.278151 -0.143507
C 0.058471 -0.678905 -0.928697 -0.275203
D -0.338097 2.616637 -1.541948 0.127263
E -0.460966 -1.571678 -1.958119 -0.533291
F -0.832041 -0.103917 1.635657 0.887229
G -0.171535 -2.181684 -1.682033 0.462375
H -1.001720 0.925746 -0.205054 -1.493562
I -0.213802 -1.214741 -0.864377 0.735728
J -0.375842 0.406692 0.207853 0.626410
Join
SQL style merges. See the Database style joining section.
类似与SQL样式的合并
In [21]: left Out[21]: key lval 0 foo 1 1 foo 2 In [22]: right Out[22]: key rval 0 foo 4 1 foo 5 In [23]: pd.merge(left, right, on="key") Out[23]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
类似与sql里面的join的内联
另一种形式的样式
In [24]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
In [25]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
...:
In [26]: pd.merge(left, right, on="key")
Out[26]:
key lval rval
0 foo 1 4
1 bar 2 5
In [27]: left
Out[27]:
key lval
0 foo 1
1 bar 2
In [28]: right
Out[28]:
key rval
0 foo 4
1 bar 5
我的理解与测试,需要两个相同的列名,取两个列名中的交集。
Grouping¶
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
- 根据某些条件将数据分成不同的组
Applying a function to each group independently
- 将功能独立应用到每一个组
Combining the results into a data structure
- 将结果合并成一个数据结构
See the Grouping section.
In [33]: df = pd.DataFrame(
...: ....: {
...: ....: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: ....: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: ....: "C": np.random.randn(8),
...: ....: "D": np.random.randn(8),
...: ....: }
...: ....: )
In [34]: df
Out[34]:
A B C D
0 foo one 0.964528 -0.464514
1 bar one 2.223520 1.743989
2 foo two -0.311332 0.242599
3 bar three -1.119041 -1.223514
4 foo two -0.879870 -1.140529
5 bar two -1.158119 -0.577413
6 foo one 0.349084 -0.770265
7 foo three -0.063142 -2.306210
In [35]: df.groupby('A')
Out[35]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87e4b64d60>
In [36]: df.groupby('A').count()
Out[36]:
B C D
A
bar 3 3 3
foo 5 5 5
In [37]: df.groupby('A').sum()
Out[37]:
C D
A
bar -0.053639 -0.056937
foo 0.059268 -4.438920
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
In [38]: df.groupby(["A", "B"]).sum()
Out[38]:
C D
A B
bar one 2.223520 1.743989
three -1.119041 -1.223514
two -1.158119 -0.577413
foo one 1.313612 -1.234779
three -0.063142 -2.306210
two -1.191203 -0.897931
Reshaping
See the sections on Hierarchical Indexing and Reshaping.
tuples
Out[49]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [50]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
In [52]: df = pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])
In [53]: df
Out[53]:
A B
first second
bar one 1.468431 1.239477
two 1.134492 0.163859
baz one -0.646760 -1.188458
two 1.389973 -1.437824
foo one 0.438599 0.053046
two 0.588275 0.074167
qux one -0.197791 -1.983309
two 0.725947 0.305324
In [54]: df = df[:4]
In [55]: df
Out[55]:
A B
first second
bar one 1.468431 1.239477
two 1.134492 0.163859
baz one -0.646760 -1.188458
two 1.389973 -1.437824
建了一份数据,用了联合索引,取了前面4条
The stack() method “compresses” a level in the DataFrame’s columns.
我的理解是stack是将列的索引添加到行的索引进行叠加,使行索引变成多层索引[多一层索引],多列数据变成单列数据[少一层索引],unstack就使将行索引转换到列索引了。
In [76]: stacked = df.stack()
In [77]: stacked
Out[77]:
first second
bar one A 1.468431
B 1.239477
two A 1.134492
B 0.163859
baz one A -0.646760
B -1.188458
two A 1.389973
B -1.437824
dtype: float64
In [78]: stacked.index
Out[78]:
MultiIndex([('bar', 'one', 'A'),
('bar', 'one', 'B'),
('bar', 'two', 'A'),
('bar', 'two', 'B'),
('baz', 'one', 'A'),
('baz', 'one', 'B'),
('baz', 'two', 'A'),
('baz', 'two', 'B')],
names=['first', 'second', None])
In [79]: stacked.__class__
Out[79]: pandas.core.series.Series
从输出可以看到,stack的方法,将df从df对象,转换成Series对象,本来两层的index,变成了三层
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:
In [80]: stacked
Out[80]:
first second
bar one A 1.468431
B 1.239477
two A 1.134492
B 0.163859
baz one A -0.646760
B -1.188458
two A 1.389973
B -1.437824
dtype: float64
In [81]: stacked.unstack()
Out[81]:
A B
first second
bar one 1.468431 1.239477
two 1.134492 0.163859
baz one -0.646760 -1.188458
two 1.389973 -1.437824
In [82]: stacked.unstack().__class__
Out[82]: pandas.core.frame.DataFrame
In [83]: stacked.unstack(0)
Out[83]:
first bar baz
second
one A 1.468431 -0.646760
B 1.239477 -1.188458
two A 1.134492 1.389973
B 0.163859 -1.437824
有了stack的基础,unstack就更加好理解了,默认转移的是多层index的最后一层,但也可以指定多层index的索引。
Pivot tables¶
See the section on Pivot Tables.
数据透视表,根据我自身的理解,感觉就是在原有的数据上面,挑选合适的列选为index,合适的列的column,合适的列为值。In [84]: df = pd.DataFrame(
...: .....: {
...: .....: "A": ["one", "one", "two", "three"] * 3,
...: .....: "B": ["A", "B", "C"] * 4,
...: .....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
...: .....: "D": np.random.randn(12),
...: .....: "E": np.random.randn(12),
...: .....: }
...: .....: )
...: .....:
In [85]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
Out[85]:
C bar foo
A B
one A -1.549085 0.117403
B -0.343714 -0.231218
C -0.890250 -1.735391
three A -0.590521 NaN
B NaN -2.155207
C -0.771405 NaN
two A NaN 1.263717
B 0.061399 NaN
C NaN 0.912017
In [86]: pd.pivot_table(df, values="E", index=["A", "B"], columns=["C"])
Out[86]:
C bar foo
A B
one A -0.908049 0.088130
B -0.661966 1.091079
C -1.096862 0.875454
three A -0.425327 NaN
B NaN 0.474478
C -0.928048 NaN
two A NaN -0.663361
B -1.199722 NaN
C NaN -0.270919
In [87]: pd.pivot_table(df, values=["E",'D'], index=["A", "B"], columns=["C"])
Out[87]:
D E
C bar foo bar foo
A B
one A -1.549085 0.117403 -0.908049 0.088130
B -0.343714 -0.231218 -0.661966 1.091079
C -0.890250 -1.735391 -1.096862 0.875454
three A -0.590521 NaN -0.425327 NaN
B NaN -2.155207 NaN 0.474478
C -0.771405 NaN -0.928048 NaN
two A NaN 1.263717 NaN -0.663361
B 0.061399 NaN -1.199722 NaN
C NaN 0.912017 NaN -0.270919
Time series¶
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.
浙公网安备 33010602011771号