Python数据分析-数据排序与排名

1. 数据排序

　　数据排序，可以降序、升序或者按照一定条件，常用方法df.sort_values()，方法如下：

DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

参数说明：

by：str or list of str

　　Name or list of names to sort by.

if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.

axis：“{0 or ‘index’, 1 or ‘columns’}”, default 0

　　Axis to be sorted.

ascending：bool or list of bool, default True

　　Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

inplace：bool, default False

　　If True, perform operation in-place.

kind：{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’

　　Choice of sorting algorithm. See also numpy.sort()for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.

na_position：{‘first’, ‘last’}, default ‘last’

　　Puts NaNs at the beginning if first; last puts NaNs at the end.

ignore_index：bool, default False

　　If True, the resulting axis will be labeled 0, 1, …, n - 1.

key：callable, optional

　　Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

代码示例：

 1 df = pd.DataFrame({
 2     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
 3     'col2': [2, 1, 9, 8, 7, 4],
 4     'col3': [0, 1, 9, 4, 2, 3],
 5     'col4': ['a', 'B', 'c', 'D', 'e', 'F']
 6 })
 7 
 8 ### 结果
 9 #   col1  col2  col3 col4
10 # 0    A     2     0    a
11 # 1    A     1     1    B
12 # 2    B     9     9    c
13 # 3  NaN     8     4    D
14 # 4    D     7     2    e
15 # 5    C     4     3    F

 1 # 以col1排序
 2 df1 = df.sort_values(by=['col1'])
 3 print(df1)
 4 
 5 #### 结果
 6 #   col1  col2  col3 col4
 7 # 0    A     2     0    a
 8 # 1    A     1     1    B
 9 # 2    B     9     9    c
10 # 5    C     4     3    F
11 # 4    D     7     2    e
12 # 3  NaN     8     4    D

 1 # 以多列形式排序
 2 df1 = df.sort_values(by=['col1', 'col2'])
 3 print(df1)
 4 
 5 ### 结果
 6 #   col1  col2  col3 col4
 7 # 1    A     1     1    B
 8 # 0    A     2     0    a
 9 # 2    B     9     9    c
10 # 5    C     4     3    F
11 # 4    D     7     2    e
12 # 3  NaN     8     4    D

 1 # 降序
 2 df1 = df.sort_values(by='col1', ascending=False)
 3 print(df1)
 4 
 5 ### 结果
 6 #   col1  col2  col3 col4
 7 # 4    D     7     2    e
 8 # 5    C     4     3    F
 9 # 2    B     9     9    c
10 # 0    A     2     0    a
11 # 1    A     1     1    B
12 # 3  NaN     8     4    D

 1 # NAs优先
 2 df1 = df.sort_values(by='col1', ascending=False, na_position='first')
 3 print(df1)
 4 
 5 ### 结果
 6 #   col1  col2  col3 col4
 7 # 3  NaN     8     4    D
 8 # 4    D     7     2    e
 9 # 5    C     4     3    F
10 # 2    B     9     9    c
11 # 0    A     2     0    a
12 # 1    A     1     1    B

 1 # 用(键)函数排序
 2 df1 = df.sort_values(by='col4', key=lambda col: col.str.lower())
 3 print(df1)
 4 
 5 ### 结果
 6 #   col1  col2  col3 col4
 7 # 0    A     2     0    a
 8 # 1    A     1     1    B
 9 # 2    B     9     9    c
10 # 3  NaN     8     4    D
11 # 4    D     7     2    e
12 # 5    C     4     3    F

2. 数据排名

　　一般使用df.rank()方法，语法如下：

DataFrame.rank(axis=0, method='average', numeric_only=False, na_option='keep', ascending=True, pct=False)

　　默认情况下，相等的值被分配的秩是这些值的秩的平均值。

参数说明：

axis：{0 or ‘index’, 1 or ‘columns’}, default 0

　　Index to direct ranking. For Series this parameter is unused and defaults to 0.

method：{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}, default ‘average’

　　How to rank the group of records that have the same value (i.e. ties):

average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.

numeric_only：bool, default False

　　For DataFrame objects, rank only numeric columns if set to True.

na_option：{‘keep’, ‘top’, ‘bottom’}, default ‘keep’

　　How to rank NaN values:

keep: assign NaN rank to NaN values
top: assign lowest rank to NaN values
bottom: assign highest rank to NaN values

ascending：bool, default True

　　Whether or not the elements should be ranked in ascending order.

pct：bool, default False

　　Whether or not to display the returned rankings in percentile form.

代码示例：

 1 df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog',
 2                                    'spider', 'snake'],
 3                         'Number_legs': [4, 2, 4, 8, np.nan]})
 4 
 5 ### 结果
 6 #     Animal  Number_legs
 7 # 0      cat          4.0
 8 # 1  penguin          2.0
 9 # 2      dog          4.0
10 # 3   spider          8.0
11 # 4    snake          NaN

 1 # 不使用任何参数的情况下获得的默认行为
 2 df['default_rank'] = df['Number_legs'].rank()
 3 print(df)
 4 
 5 ### 结果
 6 #     Animal  Number_legs  default_rank
 7 # 0      cat          4.0           2.5
 8 # 1  penguin          2.0           1.0
 9 # 2      dog          4.0           2.5
10 # 3   spider          8.0           4.0
11 # 4    snake          NaN           NaN

 1 # 具有相同值的记录使用最高级别进行排序（最大值排名，最小值同理）
 2 df['max_rank'] = df['Number_legs'].rank(method='max')
 3 print(df)
 4 
 5 ### 结果
 6     Animal  Number_legs  max_rank
 7 0      cat          4.0       3.0
 8 1  penguin          2.0       1.0
 9 2      dog          4.0       3.0
10 3   spider          8.0       4.0
11 4    snake          NaN       NaN

 1 # 如果有具有NaN值的记录，它们将被放置在排名的底部
 2 df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom')
 3 print(df)
 4 
 5 # 结果
 6 #     Animal  Number_legs  NA_bottom
 7 # 0      cat          4.0        2.5
 8 # 1  penguin          2.0        1.0
 9 # 2      dog          4.0        2.5
10 # 3   spider          8.0        4.0
11 # 4    snake          NaN        5.0

 1 # 排名表示为百分位排名
 2 df['pct_rank'] = df['Number_legs'].rank(pct=True)
 3 print(df)
 4 
 5 ### 结果
 6 #     Animal  Number_legs  pct_rank
 7 # 0      cat          4.0     0.625
 8 # 1  penguin          2.0     0.250
 9 # 2      dog          4.0     0.625
10 # 3   spider          8.0     1.000
11 # 4    snake          NaN       NaN

时间：2024年2月4日

posted @ 2024-02-04 18:52 一路狂奔的乌龟阅读(72) 评论(0) 收藏举报

刷新页面返回顶部

一路狂奔的乌龟

别听世俗的耳语，去看自己喜欢的风景。

Python数据分析-数据排序与排名

1. 数据排序

2. 数据排名

公告