Python数据分析-数据排序与排名
1. 数据排序
数据排序,可以降序、升序或者按照一定条件,常用方法df.sort_values(),方法如下:
DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
参数说明:
- by:str or list of str
Name or list of names to sort by.
- if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
- if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.
- axis:“{0 or ‘index’, 1 or ‘columns’}”, default 0
Axis to be sorted.
- ascending:bool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
- inplace:bool, default False
If True, perform operation in-place.
- kind:{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’
Choice of sorting algorithm. See also numpy.sort()for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.
- na_position:{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the end.
- ignore_index:bool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
- key:callable, optional
Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
代码示例:
1 df = pd.DataFrame({ 2 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'], 3 'col2': [2, 1, 9, 8, 7, 4], 4 'col3': [0, 1, 9, 4, 2, 3], 5 'col4': ['a', 'B', 'c', 'D', 'e', 'F'] 6 }) 7 8 ### 结果 9 # col1 col2 col3 col4 10 # 0 A 2 0 a 11 # 1 A 1 1 B 12 # 2 B 9 9 c 13 # 3 NaN 8 4 D 14 # 4 D 7 2 e 15 # 5 C 4 3 F
1 # 以col1排序 2 df1 = df.sort_values(by=['col1']) 3 print(df1) 4 5 #### 结果 6 # col1 col2 col3 col4 7 # 0 A 2 0 a 8 # 1 A 1 1 B 9 # 2 B 9 9 c 10 # 5 C 4 3 F 11 # 4 D 7 2 e 12 # 3 NaN 8 4 D
1 # 以多列形式排序 2 df1 = df.sort_values(by=['col1', 'col2']) 3 print(df1) 4 5 ### 结果 6 # col1 col2 col3 col4 7 # 1 A 1 1 B 8 # 0 A 2 0 a 9 # 2 B 9 9 c 10 # 5 C 4 3 F 11 # 4 D 7 2 e 12 # 3 NaN 8 4 D
1 # 降序 2 df1 = df.sort_values(by='col1', ascending=False) 3 print(df1) 4 5 ### 结果 6 # col1 col2 col3 col4 7 # 4 D 7 2 e 8 # 5 C 4 3 F 9 # 2 B 9 9 c 10 # 0 A 2 0 a 11 # 1 A 1 1 B 12 # 3 NaN 8 4 D
1 # NAs优先 2 df1 = df.sort_values(by='col1', ascending=False, na_position='first') 3 print(df1) 4 5 ### 结果 6 # col1 col2 col3 col4 7 # 3 NaN 8 4 D 8 # 4 D 7 2 e 9 # 5 C 4 3 F 10 # 2 B 9 9 c 11 # 0 A 2 0 a 12 # 1 A 1 1 B
1 # 用(键)函数排序 2 df1 = df.sort_values(by='col4', key=lambda col: col.str.lower()) 3 print(df1) 4 5 ### 结果 6 # col1 col2 col3 col4 7 # 0 A 2 0 a 8 # 1 A 1 1 B 9 # 2 B 9 9 c 10 # 3 NaN 8 4 D 11 # 4 D 7 2 e 12 # 5 C 4 3 F
2. 数据排名
一般使用df.rank()方法,语法如下:
DataFrame.rank(axis=0, method='average', numeric_only=False, na_option='keep', ascending=True, pct=False)
默认情况下,相等的值被分配的秩是这些值的秩的平均值。
参数说明:
- axis:{0 or ‘index’, 1 or ‘columns’}, default 0
Index to direct ranking. For Series this parameter is unused and defaults to 0.
- method:{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’
How to rank the group of records that have the same value (i.e. ties):
- average: average rank of the group
- min: lowest rank in the group
- max: highest rank in the group
- first: ranks assigned in order they appear in the array
- dense: like ‘min’, but rank always increases by 1 between groups.
- numeric_only:bool, default False
For DataFrame objects, rank only numeric columns if set to True.
- na_option:{‘keep’, ‘top’, ‘bottom’}, default ‘keep’
How to rank NaN values:
- keep: assign NaN rank to NaN values
- top: assign lowest rank to NaN values
- bottom: assign highest rank to NaN values
- ascending:bool, default True
Whether or not the elements should be ranked in ascending order.
- pct:bool, default False
Whether or not to display the returned rankings in percentile form.
代码示例:
1 df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', 2 'spider', 'snake'], 3 'Number_legs': [4, 2, 4, 8, np.nan]}) 4 5 ### 结果 6 # Animal Number_legs 7 # 0 cat 4.0 8 # 1 penguin 2.0 9 # 2 dog 4.0 10 # 3 spider 8.0 11 # 4 snake NaN
1 # 不使用任何参数的情况下获得的默认行为 2 df['default_rank'] = df['Number_legs'].rank() 3 print(df) 4 5 ### 结果 6 # Animal Number_legs default_rank 7 # 0 cat 4.0 2.5 8 # 1 penguin 2.0 1.0 9 # 2 dog 4.0 2.5 10 # 3 spider 8.0 4.0 11 # 4 snake NaN NaN
1 # 具有相同值的记录使用最高级别进行排序(最大值排名,最小值同理) 2 df['max_rank'] = df['Number_legs'].rank(method='max') 3 print(df) 4 5 ### 结果 6 Animal Number_legs max_rank 7 0 cat 4.0 3.0 8 1 penguin 2.0 1.0 9 2 dog 4.0 3.0 10 3 spider 8.0 4.0 11 4 snake NaN NaN
1 # 如果有具有NaN值的记录,它们将被放置在排名的底部 2 df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom') 3 print(df) 4 5 # 结果 6 # Animal Number_legs NA_bottom 7 # 0 cat 4.0 2.5 8 # 1 penguin 2.0 1.0 9 # 2 dog 4.0 2.5 10 # 3 spider 8.0 4.0 11 # 4 snake NaN 5.0
1 # 排名表示为百分位排名 2 df['pct_rank'] = df['Number_legs'].rank(pct=True) 3 print(df) 4 5 ### 结果 6 # Animal Number_legs pct_rank 7 # 0 cat 4.0 0.625 8 # 1 penguin 2.0 0.250 9 # 2 dog 4.0 0.625 10 # 3 spider 8.0 1.000 11 # 4 snake NaN NaN
时间:2024年2月4日

Python数据分析-数据排序与排名
浙公网安备 33010602011771号