4-Pandas数据预处理之离散化、面元划分(等距pd.cut()、等频pd.pcut()))

  有时在处理连续型数据时,为了方便分析,需要将其进行离散化或者是拆分成“面元(bin)”,即将数据放置于一个小区间中。

  在Pandas中,cut()--->数据离散化

        qcut()-->面元划分

一、cut():等距离散化,设置的bins的每个区间的间隔相等

  与排序与随机重排中采用同样的例子,即“新冠肺炎”的例子。

  此时对累计确诊那一列进行操作,首先查看其最大值和最小值,便于了解将数据划分为多少个组别:在此将数据划分7个组别,如下:

>>> df['total_confirm'].max()
677146
>>> df['total_confirm'].min()
1
>>> bins = [0,10000,20000,30000,40000,50000,60000,70000]
>>> pd.cut(df['total_confirm'],bins)[:8]
0        (0.0, 10000.0]
1        (0.0, 10000.0]
2                   NaN
3    (10000.0, 20000.0]
4        (0.0, 10000.0]
5        (0.0, 10000.0]
6    (10000.0, 20000.0]
7        (0.0, 10000.0]
Name: total_confirm, dtype: category
Categories (7, interval[int64]): [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000] <
                                  (40000, 50000] < (50000, 60000] < (60000, 70000]]

  通过labels参数可以将这些区间换成其他的字符串

>>> pd.cut(df['total_confirm'],bins=bins,labels=['A','B','C','D','E','F','G'])[:8]
0      A
1      A
2    NaN
3      B
4      A
5      A
6      B
7      A
Name: total_confirm, dtype: category
Categories (7, object): [A < B < C < D < E < F < G]

二、qcut():等频离散化,每个区间的样本数相同

#分成8个等频区间
>>> bs = pd.qcut(df['total_confirm'],8)[:5]
>>> bs = pd.qcut(df['total_confirm'],8)
>>> bs[:5]
0         (380.5, 979.5]
1     (2720.75, 8321.25]
2    (8321.25, 677146.0]
3    (8321.25, 677146.0]
4       (979.5, 2720.75]
Name: total_confirm, dtype: category
Categories (8, interval[float64]): [(0.999, 12.0] < (12.0, 35.0] < (35.0, 122.375] <
                                    (122.375, 380.5] < (380.5, 979.5] < (979.5, 2720.75] <
                                    (2720.75, 8321.25] < (8321.25, 677146.0]]

#查看每个区间的样本数
>>> bs.value_counts()
(0.999, 12.0]          28
(8321.25, 677146.0]    26
(979.5, 2720.75]       26
(2720.75, 8321.25]     25
(380.5, 979.5]         25
(122.375, 380.5]       25
(12.0, 35.0]           25
(35.0, 122.375]        24
Name: total_confirm, dtype: int64

从每个区间的样本数可以发现,每个区间的样本数挺不是完全相等的,所以:此处的等频真正的含义是每个区间的数量并不是理想中的等量,而是大致等量

posted @ 2020-07-30 18:15  大脸猫12581  阅读(1418)  评论(0编辑  收藏  举报