kaggle--Machine Learning Competitions
策略1:
1 X1=home_data.drop(['Id', 'SalePrice'], axis=1)
2 对X1中的所有object和int数据进行插值,strategy='most_frequent',存入变量imputed_X1
3 对imputed_X1中列进行筛选,选出object类型少于10的列,和数值类型的列,选好列后存入变量candidate_X
4 对candidate_X进行one hot编码,存入变量one_hot_encoded_X
5 使用one_hot_encoded_X作为X,测试模型性能
结果
Series([], dtype: int64)
Validation MAE when not specifying max_leaf_nodes: 32,137
Validation MAE for best value of max_leaf_nodes: 30,219
Validation MAE for Random Forest Model: 24,967
这个策略在用测试集传入模型时遇到了错误,详见E:\kaggle\Exercise_Machine_Learning_Competitions
策略2:(version 4)
1 查看所有列中,object列有多少,numeric列有多少
2 drop掉所有object列,只保存numeric列,存入X(Xdrop掉'Id', 'SalePrice'这两列)
3 查看X中每列空值的数量,然后用most_frequent进行插值,存入imputed_X
4 使用imputed_X作为X,测试模型性能
结果
Validation MAE when not specifying max_leaf_nodes: 27,857
Validation MAE for best value of max_leaf_nodes: 24,477
Validation MAE for Random Forest Model: 18,383
18010.152分,803名
策略3:(version 6)
1 查看所有列中,object列有多少,numeric列有多少
2 drop掉所有object列,只保存numeric列,存入X(Xdrop掉'Id', 'SalePrice'这两列)
3 查看X中每列空值的数量
4 在X的基础上增加Nan值标识列
5 然后用most_frequent进行插值,存入imputed_X
6 使用imputed_X作为X,测试模型性能
7 测试数据读入test_data
8 将test_X和X的列统一
9 在test_X的基础上增加X的Nan值标识列,存入test_X_plus
10 对test_X_plus插值,存入imputed_test_X
11 用imputed_test_X进行预测,得到test_preds
12 提交test_preds
Validation MAE when not specifying max_leaf_nodes: 27,763
Validation MAE for best value of max_leaf_nodes: 25,628
Validation MAE for Random Forest Model: 18,588
18004.81481分,801名
不足之处:X的Nan标识列,不适用于test_X,因为两个数据集中的空值所在的列是不同的
策略4:(version 8)
1 读入home_data和test_data
2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)
3 all_data_predictors = home_data_predictors.append(test_data_predictors)
4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors
5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors
6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors
7 将object_all_data_predictors中的含有空值的列drop,存入reduced_object_all_data_predictors
8 对reduced_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors
9 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错
10 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data
11 将final_all_data按照原来的比例拆分成final_train_data和final_test_data
12 使用final_train_data为数据集建模
结果:
Validation MAE when not specifying max_leaf_nodes: 26,930
Validation MAE for best value of max_leaf_nodes: 25,834
Validation MAE for Random Forest Model: 18,049
17509.02235分,757名
策略5:(version 9)
1 读入home_data和test_data
2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)
3 all_data_predictors = home_data_predictors.append(test_data_predictors)
4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors
5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors
6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors
7 将object_all_data_predictors中的含有空值的列drop,存入reduced_object_all_data_predictors
8 将reduced_object_all_data_predictors中内容种类小于10的object列选出,存入low_reduced_object_all_data_predictors(与策略4的区别)
9 对low_reduced_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors
10 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错
11 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data
12 将final_all_data按照原来的比例拆分成final_train_data和final_test_data
13 使用final_train_data为数据集建模
结果:
Validation MAE when not specifying max_leaf_nodes: 26,713
Validation MAE for best value of max_leaf_nodes: 24,391
Validation MAE for Random Forest Model: 17,688
17415.19931分,750名
策略6:(version 10)
1 读入home_data和test_data
2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)
3 all_data_predictors = home_data_predictors.append(test_data_predictors)
4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors
5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors
6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors
7 对object_all_data_predictors进行插值(most_frequent),存入inputed_object_all_data_predictors(与策略5区别)
8 将inputed_object_all_data_predictors中内容种类小于10的object列选出,存入low_inputed_object_all_data_predictors
9 对low_inputed_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors
10 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错
11 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data
12 将final_all_data按照原来的比例拆分成final_train_data和final_test_data
13 使用final_train_data为数据集建模
结果:
Validation MAE when not specifying max_leaf_nodes: 25,369
Validation MAE for best value of max_leaf_nodes: 22,942
Validation MAE for Random Forest Model: 18,181
17262.91165分,734名
策略6对object列中的空值进行插值,没有drop有空值的列,Validation MAE反而不如drop掉有控制的列好(不如策略5),但是提交后的分数比策略5高
策略7:(version 11)
1 读入home_data和test_data
2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)
3 all_data_predictors = home_data_predictors.append(test_data_predictors)
4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors
5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors
6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors
7 对object_all_data_predictors进行插值(most_frequent),存入inputed_object_all_data_predictors(与策略5区别)
8 将inputed_object_all_data_predictors中内容种类小于10的object列选出,存入low_inputed_object_all_data_predictors
9 对low_inputed_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors
10 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错
11 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data
12 将final_all_data按照原来的比例拆分成final_train_data和final_test_data
13 使用final_train_data为数据集建模
结果:
Validation MAE when not specifying max_leaf_nodes: 25,369
Validation MAE for best value of max_leaf_nodes: 22,942
Validation MAE for Random Forest Model: 18,181
Validation MAE for XGBoost Model: 14,643(和策略6的区别,使用了XGBoost模型)
14815.09764分,289名
策略8:(version 14)
1 读入home_data和test_data
2 home_data_copy = home_data.copy()
3 设置要drop掉的列的列表attributes_drop
4 X = home_data_copy.drop(attributes_drop, axis=1)
5 X = pd.get_dummies(X)
6 train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
7 用SimpleImputer给train_X_inputed和val_X_inputed插值
8 训练XGBRegressor模型,测试模型分数
9 test_X = test_data.copy()
10 test_X = test_data.drop(attributes_drop, axis=1)
11 test_X = pd.get_dummies(test_X)
12 final_train, final_test = X.align(test_X, join='left', axis=1)
13 用SimpleImputer给final_test和final_train插值
14 训练XGBRegressor模型,提交结果
Validation MAE for XGBoost Model: 15,272
14200.99245分,161名
策略9(Version21)
1 读入home_data和test_data
2 attributes_drop = ['SalePrice', 'MiscVal', 'MSSubClass', 'MoSold', 'YrSold', 'GarageArea', 'GarageYrBlt', 'TotRmsAbvGrd']
3 X = home_data_copy.drop(attributes_drop, axis=1)
4 X = pd.get_dummies(X, dummy_na=True)
5 test_X = test_data.drop(attributes_drop, axis=1)
6 test_X = pd.get_dummies(test_X, dummy_na=True)
7 final_train, final_test = X.align(test_X, join='left', axis=1)
8 final_test_imputed = my_imputer.transform(final_test)
9 final_train_imputed = my_imputer.fit_transform(final_train)
10 测试xgboost分数
11 训练3个模型:StackingCVRegressor,ridge,xgboost
12 构造多模型预测函数:blend_models_predict
13 test_preds = blend_models_predict(final_test_imputed)
13448.88653分,86名
策略10(Version 24)
12267.88649分,17名
策略11(version 3)
12145.563分,6名
1 读入train和test
2 删除极端值 train = train[train.GrLivArea < 4500]
3 目标列取log train["SalePrice"] = np.log1p(train["SalePrice"])
4 合并训练集和测试集 features = pd.concat([train_features, test_features])
5 某些列虽然是数值型,但实际是分类型(年,月,日等),将这些数值型的列转化为分类型 features['MSSubClass'] = features['MSSubClass'].apply(str)
6 填充缺失值
7 构建三个管道,分别用来处理skwe>0.5的列(boxcox1p),构造特征,对某些分类型的列编码(labelencoder)
8 get_dummies
9 把features 拆分成X和X_sub
10 drop outliers(怎么找到的不清楚)
11 drop overfit(怎么找到的不清楚)
12 开始交叉验证
13 训练模型
14 根据交叉验证分数融合模型
15 得到融合模型的预测结果后,将结果还原 submission.iloc[:, 1] = np.floor(np.expm1(blend_models_predict(X_sub)))
posted on 2019-03-01 15:59 wangzhonghan 阅读(203) 评论(0) 收藏 举报
浙公网安备 33010602011771号