kaggle教程--2--模型评估
1 平均绝对误差(Mean Absolute Error)(MAE)
from sklearn.metrics import mean_absolute_error
predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
2 验证集
测试数据,不能是训练数据的一部分,应该是区别于训练数据的数据集,叫验证集
3 把原始数据,分为训练数据和测试数据(train_test_split(X, y, random_state = 0))
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
4 特征和标签(features and target)
5 head()方法,只有panda的对象才有,numpy.ndarray没有
train_X, val_X, train_y, val_y = train_test_split(X,y,random_state = 1)
val_predictions = iowa_model.predict(val_X)
如果要显示预测结果val_predictions和原标签val_y 的前5行:
print(val_predictions[:5])
print(val_y.head())
6 过拟合(overfitting)和欠拟合(underfitting)
过拟合是模型在训练集上测试正确率很高,但在新的测试集上测试正确率很低
欠拟合是模型没有捕捉到训练集数据的模式和特征,在训练集和测试集上测试正确率都很低
决策树层数过多会导致过拟合,层数过少会导致欠拟合,我们希望找到一个平衡点的层数
7 DecisionTreeRegressor函数的max_leaf_nodes参数,用于找到过拟合和欠拟合的平衡点
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))
8 循环找出模型最佳参数时的两种代码写法:
常规法:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
best_size = 5
min_mae = 99999
for max_leaf_nodes in candidate_max_leaf_nodes:
mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
if mae<min_mae:
min_mae = mae
best_size = max_leaf_nodes
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = best_size
字典推导式(dict comprehension):---------------------还没有完全弄明白
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
9 找到模型的最佳参数后,用全部数据X重新fit模型,而不是用被分开的部分数据train_X
# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
# fit the final model
final_model.fit(X, y)
posted on 2019-02-26 15:10 wangzhonghan 阅读(424) 评论(0) 收藏 举报
浙公网安备 33010602011771号