别人没那么重要,我也没那么重要,好好活着,把能做的小事做好,够不到的东西就放弃,承认就好。做一个心情好能睡着的人,你所有事情都会在正轨上。

Python数据分析-索引设置

1. 主要内容

  1. 重新设置索引
  2. 设置某列为行索引
  3. 数据清洗后重新设置连续的行索引

2. 重新设置索引

  主要使用reindex()方法,可用pandas.Series.reindex()pandas.DataFrame.reindex(),语法如下:

Series.reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)
Series.reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)

参数说明(以DataFrame.reindex()):

  • index:array-like, optional

  New labels for the index. Preferably an Index object to avoid duplicating data.

  • axis:int or str, optional

  Unused.

  • method:{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}

  Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

    • None (default): don’t fill gaps
    • pad / ffill: Propagate last valid observation forward to next valid.
    • backfill / bfill: Use next valid observation to fill gap.
    • nearest: Use nearest valid observations to fill gap.
  • copy:bool, default True

  Return a new object, even if the passed indexes are the same.

  • level;int or name

  Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value:scalar, default np.nan

  Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • limit:int, default None

  Maximum number of consecutive elements to forward or backward fill.

  • tolerance:optional

  Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.

  Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

代码示例:

  pandas.DataFrame.reindex()

 1 index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
 2 df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
 3                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
 4                   index=index)
 5 print(df)
 6 
 7 ### 结果
 8 #            http_status  response_time
 9 # Firefox            200           0.04
10 # Chrome             200           0.02
11 # Safari             404           0.07
12 # IE10               404           0.08
13 # Konqueror          301           1.00
14 
15 #--------------------------------------------#
16 
17 new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
18              'Chrome']
19 df1 = df.reindex(new_index)
20 print(df1)
21 ### 结果
22 #                http_status  response_time
23 # Safari               404.0           0.07
24 # Iceweasel              NaN            NaN
25 # Comodo Dragon          NaN            NaN
26 # IE10                 404.0           0.08
27 # Chrome               200.0           0.02
28 
29 #--------------------------------------------#
30 
31 df1 = df.reindex(new_index, fill_value=0)
32 print(df1)
33 
34 ### 结果
35 #                http_status  response_time
36 # Safari                 404           0.07
37 # Iceweasel                0           0.00
38 # Comodo Dragon            0           0.00
39 # IE10                   404           0.08
40 # Chrome                 200           0.02
41 
42 #--------------------------------------------#
43 
44 df1 = df.reindex(new_index, fill_value='missing')
45 print(df1)
46 
47 ### 结果
48 #               http_status response_time
49 # Safari                404          0.07
50 # Iceweasel         missing       missing
51 # Comodo Dragon     missing       missing
52 # IE10                  404          0.08
53 # Chrome                200          0.02
54 
55 #--------------------------------------------#
56 
57 df1 = df.reindex(columns=['http_status', 'user_agent'])
58 print(df1)
59 
60 ### 结果
61 #            http_status  user_agent
62 # Firefox            200         NaN
63 # Chrome             200         NaN
64 # Safari             404         NaN
65 # IE10               404         NaN
66 # Konqueror          301         NaN
67 
68 #--------------------------------------------#
69 
70 df1 = df.reindex(['http_status', 'user_agent'], axis="columns")
71 print(df1)
72 
73 ### 结果
74 #            http_status  user_agent
75 # Firefox            200         NaN
76 # Chrome             200         NaN
77 # Safari             404         NaN
78 # IE10               404         NaN
79 # Konqueror          301         NaN

   pandas.Series.reindex()

 1 s1 = Series([88, 60, 75], index=[1, 2, 3])
 2 print(s1)
 3 print(s1.reindex([1, 2, 3, 4, 5]))
 4 
 5 ### 结果
 6 # 1    88
 7 # 2    60
 8 # 3    75
 9 # dtype: int64
10 # 1    88.0
11 # 2    60.0
12 # 3    75.0
13 # 4     NaN
14 # 5     NaN
15 # dtype: float64
16 
17 #--------------------------------------------#
18 
19 # 重新设置索引,NaN以0填充
20 print(s1.reindex([1, 2, 3, 4, 5], fill_value=0))
21 
22 ### 结果
23 # 1    88
24 # 2    60
25 # 3    75
26 # dtype: int64
27 # 1    88
28 # 2    60
29 # 3    75
30 # 4     0
31 # 5     0
32 # dtype: int64
33 
34 #--------------------------------------------#
35 
36 s1 = Series([88, 60, 75], index=[1, 2, 3])
37 print(s1)
38 print(s1.reindex([1, 2, 3, 4, 5], method='ffill'))  # 向前填充
39 print(s1.reindex([1, 2, 3, 4, 5], method='bfill'))  # 向后填充
40 
41 ### 结果
42 # 1    88
43 # 2    60
44 # 3    75
45 # dtype: int64
46 # 1    88
47 # 2    60
48 # 3    75
49 # 4    75
50 # 5    75
51 # dtype: int64
52 # 1    88.0
53 # 2    60.0
54 # 3    75.0
55 # 4     NaN
56 # 5     NaN
57 # dtype: float64

  综合实例:

 1 data = [[110, 105, 99], [105, 88, 115], [109, 120, 130]]
 2 index = ['Student1', 'Student2', 'Student3']
 3 columns = ['语文', '数学', '英语']
 4 df = pd.DataFrame(data=data, index=index, columns=columns)
 5 print(df)
 6 
 7 # 通过reindex()方法重新设置行索引、列索引和行列索引
 8 print(df.reindex(['mr001', 'mr002', 'mr003', 'mr004', 'mr005']))
 9 print(df.reindex(columns=['语文', '物理', '数学', '英语']))
10 print(df.reindex(index=['Student1', 'Student2', 'Student3', 'Student4', 'Student5'], columns=['语文', '物理', '数学', '英语']))
11 
12 ### 结果
13 #           语文  数学  英语
14 # Student1   110   105    99
15 # Student2   105    88   115
16 # Student3   109   120   130
17 
18 #        语文  数学  英语
19 # mr001   NaN   NaN   NaN
20 # mr002   NaN   NaN   NaN
21 # mr003   NaN   NaN   NaN
22 # mr004   NaN   NaN   NaN
23 # mr005   NaN   NaN   NaN
24 
25 #           语文  物理  数学  英语
26 # Student1   110   NaN   105    99
27 # Student2   105   NaN    88   115
28 # Student3   109   NaN   120   130
29 
30 #            语文  物理   数学   英语
31 # Student1  110.0   NaN  105.0   99.0
32 # Student2  105.0   NaN   88.0  115.0
33 # Student3  109.0   NaN  120.0  130.0
34 # Student4    NaN   NaN    NaN    NaN
35 # Student5    NaN   NaN    NaN    NaN

3. 设置某列为行索引

  主要使用df.set_index()方法,语法如下:

DataFrame.set_index(keys, *, drop=True, append=False, inplace=False, verify_integrity=False)

参数说明:

  • keys:label or array-like or list of labels/arrays

  This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.

  • drop:bool, default True

  Delete columns to be used as the new index.

  • append:bool, default False

  Whether to append columns to existing index.

  • inplace:bool, default False

  Whether to modify the DataFrame rather than creating a new one.

  • verify_integrity:bool, default False

  Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

代码示例:

 1 df = pd.DataFrame({'month': [1, 4, 7, 10],
 2                    'year': [2012, 2014, 2013, 2014],
 3                    'sale': [55, 40, 84, 31]})
 4 print(df)
 5 
 6 ### 结果
 7 #    month  year  sale
 8 # 0      1  2012    55
 9 # 1      4  2014    40
10 # 2      7  2013    84
11 # 3     10  2014    31
12 
13 #--------------------------------------------#
14 
15 df1 = df.set_index('month')
16 print(df1)
17 
18 ### 结果
19 #        year  sale
20 # month            
21 # 1      2012    55
22 # 4      2014    40
23 # 7      2013    84
24 # 10     2014    31
25 
26 #--------------------------------------------#
27 
28 df1 = df.set_index(['year', 'month'])
29 print(df1)
30 
31 ### 结果
32 #             sale
33 # year month      
34 # 2012 1        55
35 # 2014 4        40
36 # 2013 7        84
37 # 2014 10       31
38 
39 #--------------------------------------------#
40 
41 df1 = df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
42 print(df1)
43 
44 ### 结果
45 #         month  sale
46 #   year             
47 # 1 2012      1    55
48 # 2 2014      4    40
49 # 3 2013      7    84
50 # 4 2014     10    31
51 
52 #--------------------------------------------#
53 
54 s = pd.Series([1, 2, 3, 4])
55 df1 = df.set_index([s, s**2])
56 print(df1)
57 
58 ### 结果
59 #       month  year  sale
60 # 1 1       1  2012    55
61 # 2 4       4  2014    40
62 # 3 9       7  2013    84
63 # 4 16     10  2014    31

4. 数据清洗后重新设置连续的行索引

一般使用以下方式:

1 df.dropna().reset_index(drop=True)

 

 时间:2024年2月4日

 

posted @ 2024-02-04 16:38  一路狂奔的乌龟  阅读(55)  评论(0)    收藏  举报
返回顶部