Python数据分析-索引设置
1. 主要内容
- 重新设置索引
- 设置某列为行索引
- 数据清洗后重新设置连续的行索引
2. 重新设置索引
主要使用reindex()方法,可用pandas.Series.reindex()或pandas.DataFrame.reindex(),语法如下:
Series.reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)
Series.reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)
参数说明(以DataFrame.reindex()):
- index:array-like, optional
New labels for the index. Preferably an Index object to avoid duplicating data.
- axis:int or str, optional
Unused.
- method:{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
- None (default): don’t fill gaps
- pad / ffill: Propagate last valid observation forward to next valid.
- backfill / bfill: Use next valid observation to fill gap.
- nearest: Use nearest valid observations to fill gap.
- copy:bool, default True
Return a new object, even if the passed indexes are the same.
- level;int or name
Broadcast across a level, matching Index values on the passed MultiIndex level.
- fill_value:scalar, default np.nan
Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
- limit:int, default None
Maximum number of consecutive elements to forward or backward fill.
- tolerance:optional
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
代码示例:
pandas.DataFrame.reindex()
1 index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] 2 df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], 3 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, 4 index=index) 5 print(df) 6 7 ### 结果 8 # http_status response_time 9 # Firefox 200 0.04 10 # Chrome 200 0.02 11 # Safari 404 0.07 12 # IE10 404 0.08 13 # Konqueror 301 1.00 14 15 #--------------------------------------------# 16 17 new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 18 'Chrome'] 19 df1 = df.reindex(new_index) 20 print(df1) 21 ### 结果 22 # http_status response_time 23 # Safari 404.0 0.07 24 # Iceweasel NaN NaN 25 # Comodo Dragon NaN NaN 26 # IE10 404.0 0.08 27 # Chrome 200.0 0.02 28 29 #--------------------------------------------# 30 31 df1 = df.reindex(new_index, fill_value=0) 32 print(df1) 33 34 ### 结果 35 # http_status response_time 36 # Safari 404 0.07 37 # Iceweasel 0 0.00 38 # Comodo Dragon 0 0.00 39 # IE10 404 0.08 40 # Chrome 200 0.02 41 42 #--------------------------------------------# 43 44 df1 = df.reindex(new_index, fill_value='missing') 45 print(df1) 46 47 ### 结果 48 # http_status response_time 49 # Safari 404 0.07 50 # Iceweasel missing missing 51 # Comodo Dragon missing missing 52 # IE10 404 0.08 53 # Chrome 200 0.02 54 55 #--------------------------------------------# 56 57 df1 = df.reindex(columns=['http_status', 'user_agent']) 58 print(df1) 59 60 ### 结果 61 # http_status user_agent 62 # Firefox 200 NaN 63 # Chrome 200 NaN 64 # Safari 404 NaN 65 # IE10 404 NaN 66 # Konqueror 301 NaN 67 68 #--------------------------------------------# 69 70 df1 = df.reindex(['http_status', 'user_agent'], axis="columns") 71 print(df1) 72 73 ### 结果 74 # http_status user_agent 75 # Firefox 200 NaN 76 # Chrome 200 NaN 77 # Safari 404 NaN 78 # IE10 404 NaN 79 # Konqueror 301 NaN
pandas.Series.reindex()
1 s1 = Series([88, 60, 75], index=[1, 2, 3]) 2 print(s1) 3 print(s1.reindex([1, 2, 3, 4, 5])) 4 5 ### 结果 6 # 1 88 7 # 2 60 8 # 3 75 9 # dtype: int64 10 # 1 88.0 11 # 2 60.0 12 # 3 75.0 13 # 4 NaN 14 # 5 NaN 15 # dtype: float64 16 17 #--------------------------------------------# 18 19 # 重新设置索引,NaN以0填充 20 print(s1.reindex([1, 2, 3, 4, 5], fill_value=0)) 21 22 ### 结果 23 # 1 88 24 # 2 60 25 # 3 75 26 # dtype: int64 27 # 1 88 28 # 2 60 29 # 3 75 30 # 4 0 31 # 5 0 32 # dtype: int64 33 34 #--------------------------------------------# 35 36 s1 = Series([88, 60, 75], index=[1, 2, 3]) 37 print(s1) 38 print(s1.reindex([1, 2, 3, 4, 5], method='ffill')) # 向前填充 39 print(s1.reindex([1, 2, 3, 4, 5], method='bfill')) # 向后填充 40 41 ### 结果 42 # 1 88 43 # 2 60 44 # 3 75 45 # dtype: int64 46 # 1 88 47 # 2 60 48 # 3 75 49 # 4 75 50 # 5 75 51 # dtype: int64 52 # 1 88.0 53 # 2 60.0 54 # 3 75.0 55 # 4 NaN 56 # 5 NaN 57 # dtype: float64
综合实例:
1 data = [[110, 105, 99], [105, 88, 115], [109, 120, 130]] 2 index = ['Student1', 'Student2', 'Student3'] 3 columns = ['语文', '数学', '英语'] 4 df = pd.DataFrame(data=data, index=index, columns=columns) 5 print(df) 6 7 # 通过reindex()方法重新设置行索引、列索引和行列索引 8 print(df.reindex(['mr001', 'mr002', 'mr003', 'mr004', 'mr005'])) 9 print(df.reindex(columns=['语文', '物理', '数学', '英语'])) 10 print(df.reindex(index=['Student1', 'Student2', 'Student3', 'Student4', 'Student5'], columns=['语文', '物理', '数学', '英语'])) 11 12 ### 结果 13 # 语文 数学 英语 14 # Student1 110 105 99 15 # Student2 105 88 115 16 # Student3 109 120 130 17 18 # 语文 数学 英语 19 # mr001 NaN NaN NaN 20 # mr002 NaN NaN NaN 21 # mr003 NaN NaN NaN 22 # mr004 NaN NaN NaN 23 # mr005 NaN NaN NaN 24 25 # 语文 物理 数学 英语 26 # Student1 110 NaN 105 99 27 # Student2 105 NaN 88 115 28 # Student3 109 NaN 120 130 29 30 # 语文 物理 数学 英语 31 # Student1 110.0 NaN 105.0 99.0 32 # Student2 105.0 NaN 88.0 115.0 33 # Student3 109.0 NaN 120.0 130.0 34 # Student4 NaN NaN NaN NaN 35 # Student5 NaN NaN NaN NaN
3. 设置某列为行索引
主要使用df.set_index()方法,语法如下:
DataFrame.set_index(keys, *, drop=True, append=False, inplace=False, verify_integrity=False)
参数说明:
- keys:label or array-like or list of labels/arrays
This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.
- drop:bool, default True
Delete columns to be used as the new index.
- append:bool, default False
Whether to append columns to existing index.
- inplace:bool, default False
Whether to modify the DataFrame rather than creating a new one.
- verify_integrity:bool, default False
Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.
代码示例:
1 df = pd.DataFrame({'month': [1, 4, 7, 10], 2 'year': [2012, 2014, 2013, 2014], 3 'sale': [55, 40, 84, 31]}) 4 print(df) 5 6 ### 结果 7 # month year sale 8 # 0 1 2012 55 9 # 1 4 2014 40 10 # 2 7 2013 84 11 # 3 10 2014 31 12 13 #--------------------------------------------# 14 15 df1 = df.set_index('month') 16 print(df1) 17 18 ### 结果 19 # year sale 20 # month 21 # 1 2012 55 22 # 4 2014 40 23 # 7 2013 84 24 # 10 2014 31 25 26 #--------------------------------------------# 27 28 df1 = df.set_index(['year', 'month']) 29 print(df1) 30 31 ### 结果 32 # sale 33 # year month 34 # 2012 1 55 35 # 2014 4 40 36 # 2013 7 84 37 # 2014 10 31 38 39 #--------------------------------------------# 40 41 df1 = df.set_index([pd.Index([1, 2, 3, 4]), 'year']) 42 print(df1) 43 44 ### 结果 45 # month sale 46 # year 47 # 1 2012 1 55 48 # 2 2014 4 40 49 # 3 2013 7 84 50 # 4 2014 10 31 51 52 #--------------------------------------------# 53 54 s = pd.Series([1, 2, 3, 4]) 55 df1 = df.set_index([s, s**2]) 56 print(df1) 57 58 ### 结果 59 # month year sale 60 # 1 1 1 2012 55 61 # 2 4 4 2014 40 62 # 3 9 7 2013 84 63 # 4 16 10 2014 31
4. 数据清洗后重新设置连续的行索引
一般使用以下方式:
1 df.dropna().reset_index(drop=True)
时间:2024年2月4日

Python数据分析-索引设置
浙公网安备 33010602011771号