Python数据分析-索引设置

1. 主要内容

重新设置索引
设置某列为行索引
数据清洗后重新设置连续的行索引

2. 重新设置索引

　　主要使用reindex()方法，可用pandas.Series.reindex()或pandas.DataFrame.reindex()，语法如下：

Series.reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)

Series.reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)

参数说明（以DataFrame.reindex()）：

index：array-like, optional

　　New labels for the index. Preferably an Index object to avoid duplicating data.

axis：int or str, optional

　　Unused.

method：{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}

　　Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.

copy：bool, default True

　　Return a new object, even if the passed indexes are the same.

level；int or name

　　Broadcast across a level, matching Index values on the passed MultiIndex level.

fill_value：scalar, default np.nan

　　Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

limit：int, default None

　　Maximum number of consecutive elements to forward or backward fill.

tolerance：optional

　　Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.

　　Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

代码示例：

　　pandas.DataFrame.reindex()

 1 index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
 2 df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
 3                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
 4                   index=index)
 5 print(df)
 6 
 7 ### 结果
 8 #            http_status  response_time
 9 # Firefox            200           0.04
10 # Chrome             200           0.02
11 # Safari             404           0.07
12 # IE10               404           0.08
13 # Konqueror          301           1.00
14 
15 #--------------------------------------------#
16 
17 new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
18              'Chrome']
19 df1 = df.reindex(new_index)
20 print(df1)
21 ### 结果
22 #                http_status  response_time
23 # Safari               404.0           0.07
24 # Iceweasel              NaN            NaN
25 # Comodo Dragon          NaN            NaN
26 # IE10                 404.0           0.08
27 # Chrome               200.0           0.02
28 
29 #--------------------------------------------#
30 
31 df1 = df.reindex(new_index, fill_value=0)
32 print(df1)
33 
34 ### 结果
35 #                http_status  response_time
36 # Safari                 404           0.07
37 # Iceweasel                0           0.00
38 # Comodo Dragon            0           0.00
39 # IE10                   404           0.08
40 # Chrome                 200           0.02
41 
42 #--------------------------------------------#
43 
44 df1 = df.reindex(new_index, fill_value='missing')
45 print(df1)
46 
47 ### 结果
48 #               http_status response_time
49 # Safari                404          0.07
50 # Iceweasel         missing       missing
51 # Comodo Dragon     missing       missing
52 # IE10                  404          0.08
53 # Chrome                200          0.02
54 
55 #--------------------------------------------#
56 
57 df1 = df.reindex(columns=['http_status', 'user_agent'])
58 print(df1)
59 
60 ### 结果
61 #            http_status  user_agent
62 # Firefox            200         NaN
63 # Chrome             200         NaN
64 # Safari             404         NaN
65 # IE10               404         NaN
66 # Konqueror          301         NaN
67 
68 #--------------------------------------------#
69 
70 df1 = df.reindex(['http_status', 'user_agent'], axis="columns")
71 print(df1)
72 
73 ### 结果
74 #            http_status  user_agent
75 # Firefox            200         NaN
76 # Chrome             200         NaN
77 # Safari             404         NaN
78 # IE10               404         NaN
79 # Konqueror          301         NaN

　　pandas.Series.reindex()

 1 s1 = Series([88, 60, 75], index=[1, 2, 3])
 2 print(s1)
 3 print(s1.reindex([1, 2, 3, 4, 5]))
 4 
 5 ### 结果
 6 # 1    88
 7 # 2    60
 8 # 3    75
 9 # dtype: int64
10 # 1    88.0
11 # 2    60.0
12 # 3    75.0
13 # 4     NaN
14 # 5     NaN
15 # dtype: float64
16 
17 #--------------------------------------------#
18 
19 # 重新设置索引,NaN以0填充
20 print(s1.reindex([1, 2, 3, 4, 5], fill_value=0))
21 
22 ### 结果
23 # 1    88
24 # 2    60
25 # 3    75
26 # dtype: int64
27 # 1    88
28 # 2    60
29 # 3    75
30 # 4     0
31 # 5     0
32 # dtype: int64
33 
34 #--------------------------------------------#
35 
36 s1 = Series([88, 60, 75], index=[1, 2, 3])
37 print(s1)
38 print(s1.reindex([1, 2, 3, 4, 5], method='ffill'))  # 向前填充
39 print(s1.reindex([1, 2, 3, 4, 5], method='bfill'))  # 向后填充
40 
41 ### 结果
42 # 1    88
43 # 2    60
44 # 3    75
45 # dtype: int64
46 # 1    88
47 # 2    60
48 # 3    75
49 # 4    75
50 # 5    75
51 # dtype: int64
52 # 1    88.0
53 # 2    60.0
54 # 3    75.0
55 # 4     NaN
56 # 5     NaN
57 # dtype: float64

　　综合实例：

 1 data = [[110, 105, 99], [105, 88, 115], [109, 120, 130]]
 2 index = ['Student1', 'Student2', 'Student3']
 3 columns = ['语文', '数学', '英语']
 4 df = pd.DataFrame(data=data, index=index, columns=columns)
 5 print(df)
 6 
 7 # 通过reindex()方法重新设置行索引、列索引和行列索引
 8 print(df.reindex(['mr001', 'mr002', 'mr003', 'mr004', 'mr005']))
 9 print(df.reindex(columns=['语文', '物理', '数学', '英语']))
10 print(df.reindex(index=['Student1', 'Student2', 'Student3', 'Student4', 'Student5'], columns=['语文', '物理', '数学', '英语']))
11 
12 ### 结果
13 #           语文  数学  英语
14 # Student1   110   105    99
15 # Student2   105    88   115
16 # Student3   109   120   130
17 
18 #        语文  数学  英语
19 # mr001   NaN   NaN   NaN
20 # mr002   NaN   NaN   NaN
21 # mr003   NaN   NaN   NaN
22 # mr004   NaN   NaN   NaN
23 # mr005   NaN   NaN   NaN
24 
25 #           语文  物理  数学  英语
26 # Student1   110   NaN   105    99
27 # Student2   105   NaN    88   115
28 # Student3   109   NaN   120   130
29 
30 #            语文  物理   数学   英语
31 # Student1  110.0   NaN  105.0   99.0
32 # Student2  105.0   NaN   88.0  115.0
33 # Student3  109.0   NaN  120.0  130.0
34 # Student4    NaN   NaN    NaN    NaN
35 # Student5    NaN   NaN    NaN    NaN

3. 设置某列为行索引

　　主要使用df.set_index()方法，语法如下：

DataFrame.set_index(keys, *, drop=True, append=False, inplace=False, verify_integrity=False)

参数说明：

keys：label or array-like or list of labels/arrays

　　This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.

drop：bool, default True

　　Delete columns to be used as the new index.

append：bool, default False

　　Whether to append columns to existing index.

inplace：bool, default False

　　Whether to modify the DataFrame rather than creating a new one.

verify_integrity：bool, default False

　　Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

代码示例：

 1 df = pd.DataFrame({'month': [1, 4, 7, 10],
 2                    'year': [2012, 2014, 2013, 2014],
 3                    'sale': [55, 40, 84, 31]})
 4 print(df)
 5 
 6 ### 结果
 7 #    month  year  sale
 8 # 0      1  2012    55
 9 # 1      4  2014    40
10 # 2      7  2013    84
11 # 3     10  2014    31
12 
13 #--------------------------------------------#
14 
15 df1 = df.set_index('month')
16 print(df1)
17 
18 ### 结果
19 #        year  sale
20 # month            
21 # 1      2012    55
22 # 4      2014    40
23 # 7      2013    84
24 # 10     2014    31
25 
26 #--------------------------------------------#
27 
28 df1 = df.set_index(['year', 'month'])
29 print(df1)
30 
31 ### 结果
32 #             sale
33 # year month      
34 # 2012 1        55
35 # 2014 4        40
36 # 2013 7        84
37 # 2014 10       31
38 
39 #--------------------------------------------#
40 
41 df1 = df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
42 print(df1)
43 
44 ### 结果
45 #         month  sale
46 #   year             
47 # 1 2012      1    55
48 # 2 2014      4    40
49 # 3 2013      7    84
50 # 4 2014     10    31
51 
52 #--------------------------------------------#
53 
54 s = pd.Series([1, 2, 3, 4])
55 df1 = df.set_index([s, s**2])
56 print(df1)
57 
58 ### 结果
59 #       month  year  sale
60 # 1 1       1  2012    55
61 # 2 4       4  2014    40
62 # 3 9       7  2013    84
63 # 4 16     10  2014    31

4. 数据清洗后重新设置连续的行索引

一般使用以下方式：

1 df.dropna().reset_index(drop=True)

时间：2024年2月4日

posted @ 2024-02-04 16:38 一路狂奔的乌龟阅读(55) 评论(0) 收藏举报

刷新页面返回顶部

一路狂奔的乌龟

别听世俗的耳语，去看自己喜欢的风景。

Python数据分析-索引设置

1. 主要内容

2. 重新设置索引

3. 设置某列为行索引

4. 数据清洗后重新设置连续的行索引

公告