pandas.DataFrame.reindex的使用介绍
参考链接:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html#pandas.DataFrame.reindex
DataFrame.reindex(labels=None, index=None, columns=None, axis=None, method=None, copy=True, level=None, fill_value=nan, limit=None, tolerance=None)[source]
Conform Series/DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
- Parameters
- keywords for axesarray-like, optional
-
New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data.
- method{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}
-
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
-
None (default): don’t fill gaps
-
pad / ffill: Propagate last valid observation forward to next valid.
-
backfill / bfill: Use next valid observation to fill gap.
-
nearest: Use nearest valid observations to fill gap.
-
- copybool, default True
-
Return a new object, even if the passed indexes are the same.
- levelint or name
-
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_valuescalar, default np.NaN
-
Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
- limitint, default None
-
Maximum number of consecutive elements to forward or backward fill.
- toleranceoptional
-
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
DataFrame.reindex supports two calling conventions
-
(index=index_labels, columns=column_labels, ...) -
(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
通过查寻了解,这个主要是外部定义一个索引,返回一个新的df对象,对于新的索引的缺省项,可以设置一些默认值。
可以通过两种方式传参,推荐使用第一种。
参数col_level在我调试的版本中已经改为level
书中示例代码,该方法主要用于重设index,并且为新的index中的内容添加默认值。
In [123]: index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
...: df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
...: 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
...: index=index)
In [124]: df
Out[124]:
http_status response_time
Firefox 200 0.04
Chrome 200 0.02
Safari 404 0.07
IE10 404 0.08
Konqueror 301 1.00
In [125]:
定义了一个df对象,定义了一个index
后面将定义一个新的index对象,另外使用默认参数
In [130]: new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
...: 'Chrome']
In [131]: df
Out[131]:
http_status response_time
Firefox 200 0.04
Chrome 200 0.02
Safari 404 0.07
IE10 404 0.08
Konqueror 301 1.00
In [132]: df.reindex(index=new_index)
Out[132]:
http_status response_time
Safari 404.0 0.07
Iceweasel NaN NaN
Comodo Dragon NaN NaN
IE10 404.0 0.08
Chrome 200.0 0.02
生成了一个新的df对象,添加的index
我们也可以通过fill_value的选项来设置默认值
In [133]: df.reindex(index=new_index, fill_value='missing')
Out[133]:
http_status response_time
Safari 404 0.07
Iceweasel missing missing
Comodo Dragon missing missing
IE10 404 0.08
Chrome 200 0.02
也可以通过下面两种方式重设列的索引。
In [134]: df.reindex(columns=['http_status', 'user_agent'])
Out[134]:
http_status user_agent
Firefox 200 NaN
Chrome 200 NaN
Safari 404 NaN
IE10 404 NaN
Konqueror 301 NaN
In [135]: df.reindex(['http_status', 'user_agent'], axis="columns")
Out[135]:
http_status user_agent
Firefox 200 NaN
Chrome 200 NaN
Safari 404 NaN
IE10 404 NaN
Konqueror 301 NaN
为了进一步说明reindex的使用中,针对的有序索引,使用metho的参数,填写默认值。
首先创建一个时间索引的df对象
In [137]: date_index = pd.date_range('1/1/2010', periods=6, freq='D')
...: df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},
...: index=date_index)
...:
In [138]: df2
Out[138]:
prices
2010-01-01 100.0
2010-01-02 101.0
2010-01-03 NaN
2010-01-04 100.0
2010-01-05 89.0
2010-01-06 88.0
然后通过reindex替换成一个时间周期更长的,并使用method参数。
In [139]: date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
In [140]: df2.reindex(index=date_index2)
Out[140]:
prices
2009-12-29 NaN
2009-12-30 NaN
2009-12-31 NaN
2010-01-01 100.0
2010-01-02 101.0
2010-01-03 NaN
2010-01-04 100.0
2010-01-05 89.0
2010-01-06 88.0
2010-01-07 NaN
In [141]: df2.reindex(index=date_index2, method='bfill')
Out[141]:
prices
2009-12-29 100.0
2009-12-30 100.0
2009-12-31 100.0
2010-01-01 100.0
2010-01-02 101.0
2010-01-03 NaN
2010-01-04 100.0
2010-01-05 89.0
2010-01-06 88.0
2010-01-07 NaN
In [142]:
从输出可以看出,默认的还是NAN参数,使用了后面数据为默认数据,新的索引已经添加了数据,但老的索引内的数据并没有修改。
如果需要更改,使用fillna的方法。
浙公网安备 33010602011771号