kaggle--Machine Learning Competitions

策略1:

1 X1=home_data.drop(['Id', 'SalePrice'], axis=1)

2 对X1中的所有object和int数据进行插值,strategy='most_frequent',存入变量imputed_X1

3 对imputed_X1中列进行筛选,选出object类型少于10的列,和数值类型的列,选好列后存入变量candidate_X

4 对candidate_X进行one hot编码,存入变量one_hot_encoded_X

5 使用one_hot_encoded_X作为X,测试模型性能

结果

Series([], dtype: int64)
Validation MAE when not specifying max_leaf_nodes: 32,137
Validation MAE for best value of max_leaf_nodes: 30,219
Validation MAE for Random Forest Model: 24,967

这个策略在用测试集传入模型时遇到了错误,详见E:\kaggle\Exercise_Machine_Learning_Competitions

 

 

 

策略2:(version 4)

1 查看所有列中,object列有多少,numeric列有多少

2 drop掉所有object列,只保存numeric列,存入X(Xdrop掉'Id', 'SalePrice'这两列)

3 查看X中每列空值的数量,然后用most_frequent进行插值,存入imputed_X

4 使用imputed_X作为X,测试模型性能

结果

Validation MAE when not specifying max_leaf_nodes: 27,857
Validation MAE for best value of max_leaf_nodes: 24,477
Validation MAE for Random Forest Model: 18,383

 

18010.152分,803名 

 

策略3:(version 6)

 

1 查看所有列中,object列有多少,numeric列有多少

2 drop掉所有object列,只保存numeric列,存入X(Xdrop掉'Id', 'SalePrice'这两列)

3 查看X中每列空值的数量

4 在X的基础上增加Nan值标识列

5 然后用most_frequent进行插值,存入imputed_X

6 使用imputed_X作为X,测试模型性能

7 测试数据读入test_data

8 将test_X和X的列统一

9 在test_X的基础上增加X的Nan值标识列,存入test_X_plus

10 对test_X_plus插值,存入imputed_test_X

11 用imputed_test_X进行预测,得到test_preds

12 提交test_preds

Validation MAE when not specifying max_leaf_nodes: 27,763
Validation MAE for best value of max_leaf_nodes: 25,628
Validation MAE for Random Forest Model: 18,588

 

18004.81481分,801名

不足之处:X的Nan标识列,不适用于test_X,因为两个数据集中的空值所在的列是不同的

 

策略4:(version 8)

1 读入home_data和test_data

2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)

3 all_data_predictors = home_data_predictors.append(test_data_predictors)

4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors

5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors

6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors

7 将object_all_data_predictors中的含有空值的列drop,存入reduced_object_all_data_predictors

8 对reduced_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors

9 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错

10 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data

11 将final_all_data按照原来的比例拆分成final_train_data和final_test_data

12 使用final_train_data为数据集建模

结果:

Validation MAE when not specifying max_leaf_nodes: 26,930
Validation MAE for best value of max_leaf_nodes: 25,834
Validation MAE for Random Forest Model: 18,049

17509.02235分,757名

 

策略5:(version 9)

1 读入home_data和test_data

2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)

3 all_data_predictors = home_data_predictors.append(test_data_predictors)

4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors

5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors

6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors

7 将object_all_data_predictors中的含有空值的列drop,存入reduced_object_all_data_predictors

8 将reduced_object_all_data_predictors中内容种类小于10的object列选出,存入low_reduced_object_all_data_predictors(与策略4的区别)

9 对low_reduced_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors

10 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错

11 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data

12 将final_all_data按照原来的比例拆分成final_train_data和final_test_data

13 使用final_train_data为数据集建模

结果:

Validation MAE when not specifying max_leaf_nodes: 26,713
Validation MAE for best value of max_leaf_nodes: 24,391
Validation MAE for Random Forest Model: 17,688

17415.19931分,750名

 

策略6:(version 10)

1 读入home_data和test_data

2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)

3 all_data_predictors = home_data_predictors.append(test_data_predictors)

4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors

5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors

6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors

7 对object_all_data_predictors进行插值(most_frequent),存入inputed_object_all_data_predictors(与策略5区别)

8 将inputed_object_all_data_predictors中内容种类小于10的object列选出,存入low_inputed_object_all_data_predictors

9 对low_inputed_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors

10 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错

11 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data

12 将final_all_data按照原来的比例拆分成final_train_data和final_test_data

13 使用final_train_data为数据集建模

结果:

Validation MAE when not specifying max_leaf_nodes: 25,369
Validation MAE for best value of max_leaf_nodes: 22,942
Validation MAE for Random Forest Model: 18,181

17262.91165分,734名

策略6对object列中的空值进行插值,没有drop有空值的列,Validation MAE反而不如drop掉有控制的列好(不如策略5),但是提交后的分数比策略5高

 

策略7:(version 11)

1 读入home_data和test_data

2 home_data.drop(['Id', 'SalePrice'], axis=1),test_data.drop(['Id'], axis=1)

3 all_data_predictors = home_data_predictors.append(test_data_predictors)

4 从all_data_predictors 中只选择object类型的列,存入object_all_data_predictors

5 从all_data_predictors 中只选择数字类型的列,存入numeric_all_data_predictors

6 对numeric_all_data_predictors进行插值(mean),存入inputed_numeric_all_data_predictors

7 对object_all_data_predictors进行插值(most_frequent),存入inputed_object_all_data_predictors(与策略5区别)

8 将inputed_object_all_data_predictors中内容种类小于10的object列选出,存入low_inputed_object_all_data_predictors

9 对low_inputed_object_all_data_predictors进行one-hot编码,存入one_hot_object_predictors

10 将one_hot_object_predictors的index重置,否则concat会因为index不一致而出错

11 将one_hot_object_predictors和inputed_numeric_all_data_predictors进行concat,存入final_all_data

12 将final_all_data按照原来的比例拆分成final_train_data和final_test_data

13 使用final_train_data为数据集建模

结果:

Validation MAE when not specifying max_leaf_nodes: 25,369
Validation MAE for best value of max_leaf_nodes: 22,942
Validation MAE for Random Forest Model: 18,181
Validation MAE for XGBoost Model: 14,643(和策略6的区别,使用了XGBoost模型)

14815.09764分,289名

 

策略8:(version 14)

1 读入home_data和test_data

2 home_data_copy = home_data.copy()

3 设置要drop掉的列的列表attributes_drop

4 X = home_data_copy.drop(attributes_drop, axis=1)

5 X = pd.get_dummies(X)

6 train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

7 用SimpleImputer给train_X_inputed和val_X_inputed插值

8 训练XGBRegressor模型,测试模型分数

9 test_X = test_data.copy()

10 test_X = test_data.drop(attributes_drop, axis=1)

11 test_X = pd.get_dummies(test_X)

12 final_train, final_test = X.align(test_X, join='left', axis=1)

13 用SimpleImputer给final_test和final_train插值

14 训练XGBRegressor模型,提交结果

Validation MAE for XGBoost Model: 15,272

14200.99245分,161名

 

策略9(Version21)

1 读入home_data和test_data

2 attributes_drop = ['SalePrice', 'MiscVal', 'MSSubClass', 'MoSold', 'YrSold', 'GarageArea', 'GarageYrBlt', 'TotRmsAbvGrd']

3 X = home_data_copy.drop(attributes_drop, axis=1)

4 X = pd.get_dummies(X, dummy_na=True)

5 test_X = test_data.drop(attributes_drop, axis=1)

6 test_X = pd.get_dummies(test_X, dummy_na=True)

7 final_train, final_test = X.align(test_X, join='left', axis=1)

8 final_test_imputed = my_imputer.transform(final_test)

9 final_train_imputed = my_imputer.fit_transform(final_train)

10 测试xgboost分数

11 训练3个模型:StackingCVRegressor,ridge,xgboost

12 构造多模型预测函数:blend_models_predict

13 test_preds = blend_models_predict(final_test_imputed)

13448.88653分,86名

 

策略10(Version 24)

12267.88649分,17名

 

策略11(version 3)

12145.563分,6名

 

1 读入train和test

2 删除极端值 train = train[train.GrLivArea < 4500]

3 目标列取log train["SalePrice"] = np.log1p(train["SalePrice"])

4 合并训练集和测试集 features = pd.concat([train_features, test_features])

5 某些列虽然是数值型,但实际是分类型(年,月,日等),将这些数值型的列转化为分类型 features['MSSubClass'] = features['MSSubClass'].apply(str)

6 填充缺失值

7 构建三个管道,分别用来处理skwe>0.5的列(boxcox1p),构造特征,对某些分类型的列编码(labelencoder)

8 get_dummies

9 把features 拆分成X和X_sub

10 drop outliers(怎么找到的不清楚)

11 drop overfit(怎么找到的不清楚)

12 开始交叉验证

13 训练模型

14 根据交叉验证分数融合模型

15 得到融合模型的预测结果后,将结果还原 submission.iloc[:, 1] = np.floor(np.expm1(blend_models_predict(X_sub)))

posted on 2019-03-01 15:59  wangzhonghan  阅读(203)  评论(0)    收藏  举报

导航