DataJam

数据预处理与特征工程:缺失值处理

一、sklearn中的缺失值处理模块

使用模块:sklearn.impute.SimpleImputer

 使用注意:sklearn中特征矩阵必须是二维的,所以对单列操作时候需要转换为二维均值,方法:shape(-1,1)

官网案例:

 1 >>> import numpy as np
 2 >>> from sklearn.impute import SimpleImputer
 3 >>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
 4 >>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
 5 SimpleImputer()
 6 >>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
 7 >>> print(imp_mean.transform(X))
 8 [[ 7.   2.   3. ]
 9  [ 4.   3.5  6. ]
10  [10.   3.5  9. ]]

泰坦尼克号案例:

 1 from sklearn.impute import SimpleImputer
 2 data = pd.read_csv(r'F:\Python\Narrativedata.csv')
 3 #填补年龄
 4 Age = data.loc[:,"Age"].values.reshape(-1,1) #sklearn当中特征矩阵必须是二维
 5 Age[:20]
 6 from sklearn.impute import SimpleImputer
 7 imp_mean = SimpleImputer() #实例化,默认均值填补
 8 imp_median = SimpleImputer(strategy="median") #用中位数填补
 9 imp_0 = SimpleImputer(strategy="constant",fill_value=0) #用0填补
10 imp_mean = imp_mean.fit_transform(Age) #fit_transform一步完成调取结果
11 imp_median = imp_median.fit_transform(Age)
12 imp_0 = imp_0.fit_transform(Age)
13 imp_mean[:20]
14 imp_median[:20]
15 imp_0[:20] #在这里我们使用中位数填补Age
16 data.loc[:,"Age"] = imp_median
17 data.info()
18 #使用众数填补Embarked
19 Embarked = data.loc[:,"Embarked"].values.reshape(-1,1)
20 imp_mode = SimpleImputer(strategy = "most_frequent")
21 data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)

 

 (图片来源:菜菜老师)

二、随机森林回归填补缺失值

(待补充)

三、python与numpy中缺失值填充

1 import pandas as pd
2 data = pd.read_csv(r"C:\work\learnbetter\micro-class\week 3 
3 Preprocessing\Narrativedata.csv",index_col=0)
4 data.head()
5 data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].median())
6 #.fillna 在DataFrame里面直接进行填补
7 data.dropna(axis=0,inplace=True)
8 #.dropna(axis=0)删除所有有缺失值的行,.dropna(axis=1)删除所有有缺失值的列
9 #参数inplace,为True表示在原数据集上进行修改,为False表示生成一个复制对象,不修改原数据,默认False

 

posted on 2021-06-08 16:13  DataJam  阅读(382)  评论(0)    收藏  举报

导航