kaggle教程--9--交叉验证
交叉验证(Cross Validation)
Cross-Validation and Train-Test Split
数据很多的时候,用Train-Test Split,时间短
数据不多的时候,用Cross-Validation,模型的分数准
例子1:
import pandas as pd
data = pd.read_csv('../input/melb_data.csv')
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
my_pipeline = make_pipeline(Imputer(), RandomForestRegressor())
from sklearn.model_selection import cross_val_score
scores = cross_val_score(my_pipeline, X, y, scoring='neg_mean_absolute_error')
print(scores)
print('Mean Absolute Error %2f' %(-1 * scores.mean()))
例子2 交叉验证+管道
from sklearn.model_selection import cross_val_score
my_pipeline = make_pipeline(Imputer(), XGBRegressor(n_estimators=1000, learning_rate=0.02))
scores = cross_val_score(my_pipeline, final_train_data, y, scoring='neg_mean_absolute_error')
print("交叉验证分数:")
print(scores)
print('Mean Absolute Error %2f' %(-1 * scores.mean()))
例子3:GridSearchCV
from sklearn.model_selection import GridSearchCV
# pipeline
XGB_pipeline = make_pipeline(SimpleImputer(), XGBRegressor())
# 要调参数
para_grid = {'xgbregressor__learning_rate': [0.01,0.1,0.5],'xgbregressor__n_estimators':[100,500,1000]}
grid = GridSearchCV(estimator=XGB_pipeline, param_grid = para_grid,cv=5)
grid.fit(final_train_data, y)
# 输出最佳参数组合
print (grid.best_params_)
注意:
1、要调参数为dict格式,参数名前需要加小写的XGBRegressor和双下划线也就是xgbregressor__
2、每次不要放太多参数,否则运行时间会很长,最好对所调参数有一定了解,甚至可以阅读相应已发表paper
3、GridSearchCV中的参数cv表示cross validation,cross validation对小数据量样本比适用,能充分利用所有的数据进行训练。相当于做了一次cross_val_score
posted on 2019-03-12 10:47 wangzhonghan 阅读(317) 评论(0) 收藏 举报
浙公网安备 33010602011771号