数据清洗和准备
离散化和面元划分
6 8 9 4 ===> 2 3 4 1 # 得到每个数据间的相对大小
# 离散化的思想:就是数据太大,我们没办法开那么大的数组,数字个数又不多,这个时候就可以使用离散化,
# 离散化只是改变这个数字的相对大小,并没有改变这个数字的绝对大小
# 面元划分:可以理解成分阶段
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
# 面元 bin
bins = [18,25,35,60,100] # 按照这个数据的年龄进行划分
cats = pd.cut(ages,bins) # 使用pd.cut 来进行面元的划分
cats
# 输出结果:
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
cats.codes # 打印出来一个数组,底层是含有一个不同分类的数组
# 输出结果
array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)
cats.categories # 查看数据标签
# 输出结果:
ntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]]
closed='right',
dtype='interval[int64]')
pd.value_counts(cats) #面元计数,
# 进行面元划分,统计数据在每个阶段的个数,按照数据大小进行排序
# 输出结果
(18, 25] 5
(35, 60] 3
(25, 35] 3
(60, 100] 1
dtype: int64
pd.cut(ages,[18,26,36,61,100],right=False) # right 是用来指定是右侧是开还是闭
# 输出结果:
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]
names = ['青年','年轻人','中年','老年'] # 可以设置自己的面元名称
pd.cut(ages,bins,labels=names)
# 输出结果:
[青年, 青年, 青年, 年轻人, 青年, ..., 年轻人, 老年, 中年, 中年, 年轻人]
Length: 12
Categories (4, object): [青年 < 年轻人 < 中年 < 老年]
data = np.random.rand(20)
data
# 输出结果:
array([ 0.04996272, 0.09751859, 0.93201166, 0.52240638, 0.02292138,
0.93153349, 0.51292955, 0.04350894, 0.58364788, 0.44534584,
0.31083907, 0.9286763 , 0.66816617, 0.77377502, 0.18961133,
0.66365819, 0.23383481, 0.53767344, 0.64420233, 0.67658029])
pd.cut(data,4,precision = 2) #小数位数2
# 指划分面元个数,会自动将面元进行划分,是根据最大值、最小值来划分等长的面元
# 输出结果:
[(0.022, 0.25], (0.022, 0.25], (0.7, 0.93], (0.48, 0.7], (0.022, 0.25], ..., (0.48, 0.7], (0.022, 0.25], (0.48, 0.7], (0.48, 0.7], (0.48, 0.7]]
Length: 20
Categories (4, interval[float64]): [(0.022, 0.25] < (0.25, 0.48] < (0.48, 0.7] < (0.7, 0.93]]
# qcut函数 根据样本分位数进行面元划分,可以得到的数据是大小等同的。
data = np.random.randn(1000)
data
# 输出结果:
array([ 4.85634729e-02, -1.34054158e+00, 2.72231862e-02,
2.46942122e-01, 7.44757369e-01, 2.04112537e+00,
2.88554056e-01, -1.59376789e-01, 1.03820893e+00,
2.34362566e-01, 1.26033030e-01, -5.30489341e-01,
3.35935612e-01, -1.28030309e+00, -1.82161864e+00,
1.24622137e+00, 1.79109860e+00, -1.11492088e+00,
-2.72757886e-01, 2.00095126e+00, 6.77932950e-02,
-1.61718635e+00, 8.86037558e-01, -3.68608873e-01,
-4.87571678e-01, 1.07434758e-02, 9.03368472e-01,
-2.00666200e+00, -4.62522278e-01, -3.19588645e-01,
......
cats = pd.qcut(data,4)
cats
# 输出结果:
[(-0.0313, 0.704], (-3.031, -0.727], (-0.0313, 0.704], (-0.0313, 0.704], (0.704, 3.386], ..., (-0.727, -0.0313], (-3.031, -0.727], (-0.727, -0.0313], (-3.031, -0.727], (0.704, 3.386]]
Length: 1000
Categories (4, interval[float64]): [(-3.031, -0.727] < (-0.727, -0.0313] < (-0.0313, 0.704] < (0.704, 3.386]]
pd.value_counts(cats)
# 输出结果:
(0.704, 3.386] 250
(-0.0313, 0.704] 250
(-0.727, -0.0313] 250
(-3.031, -0.727] 250
dtype: int64
cats = pd.qcut(data,[0,0.1,0.5,0.9,1.])
pd.value_counts(cats)
# 输出结果:
(-0.0313, 1.309] 400
(-1.382, -0.0313] 400
(1.309, 3.386] 100
(-3.031, -1.382] 100
dtype: int64