数据预处理与特征工程:缺失值处理
一、sklearn中的缺失值处理模块
使用模块:sklearn.impute.SimpleImputer
使用注意:sklearn中特征矩阵必须是二维的,所以对单列操作时候需要转换为二维均值,方法:shape(-1,1)
官网案例:
1 >>> import numpy as np 2 >>> from sklearn.impute import SimpleImputer 3 >>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') 4 >>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) 5 SimpleImputer() 6 >>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] 7 >>> print(imp_mean.transform(X)) 8 [[ 7. 2. 3. ] 9 [ 4. 3.5 6. ] 10 [10. 3.5 9. ]]
泰坦尼克号案例:
1 from sklearn.impute import SimpleImputer 2 data = pd.read_csv(r'F:\Python\Narrativedata.csv') 3 #填补年龄 4 Age = data.loc[:,"Age"].values.reshape(-1,1) #sklearn当中特征矩阵必须是二维 5 Age[:20] 6 from sklearn.impute import SimpleImputer 7 imp_mean = SimpleImputer() #实例化,默认均值填补 8 imp_median = SimpleImputer(strategy="median") #用中位数填补 9 imp_0 = SimpleImputer(strategy="constant",fill_value=0) #用0填补 10 imp_mean = imp_mean.fit_transform(Age) #fit_transform一步完成调取结果 11 imp_median = imp_median.fit_transform(Age) 12 imp_0 = imp_0.fit_transform(Age) 13 imp_mean[:20] 14 imp_median[:20] 15 imp_0[:20] #在这里我们使用中位数填补Age 16 data.loc[:,"Age"] = imp_median 17 data.info() 18 #使用众数填补Embarked 19 Embarked = data.loc[:,"Embarked"].values.reshape(-1,1) 20 imp_mode = SimpleImputer(strategy = "most_frequent") 21 data.loc[:,"Embarked"] = imp_mode.fit_transform(Embarked)

(图片来源:菜菜老师)
二、随机森林回归填补缺失值
(待补充)
三、python与numpy中缺失值填充
1 import pandas as pd 2 data = pd.read_csv(r"C:\work\learnbetter\micro-class\week 3 3 Preprocessing\Narrativedata.csv",index_col=0) 4 data.head() 5 data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].median()) 6 #.fillna 在DataFrame里面直接进行填补 7 data.dropna(axis=0,inplace=True) 8 #.dropna(axis=0)删除所有有缺失值的行,.dropna(axis=1)删除所有有缺失值的列 9 #参数inplace,为True表示在原数据集上进行修改,为False表示生成一个复制对象,不修改原数据,默认False
浙公网安备 33010602011771号