kaggle教程--2--模型评估

1 平均绝对误差(Mean Absolute Error)(MAE)

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

2 验证集

测试数据，不能是训练数据的一部分，应该是区别于训练数据的数据集，叫验证集

3 把原始数据，分为训练数据和测试数据（train_test_split(X, y, random_state = 0)）

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

4 特征和标签（features and target）

5 head()方法，只有panda的对象才有，numpy.ndarray没有

train_X, val_X, train_y, val_y = train_test_split(X,y,random_state = 1)

val_predictions = iowa_model.predict(val_X)

如果要显示预测结果val_predictions和原标签val_y 的前5行：

print(val_predictions[:5])

print(val_y.head())

6 过拟合(overfitting)和欠拟合(underfitting)

过拟合是模型在训练集上测试正确率很高，但在新的测试集上测试正确率很低

欠拟合是模型没有捕捉到训练集数据的模式和特征，在训练集和测试集上测试正确率都很低

决策树层数过多会导致过拟合，层数过少会导致欠拟合，我们希望找到一个平衡点的层数

7 DecisionTreeRegressor函数的max_leaf_nodes参数，用于找到过拟合和欠拟合的平衡点

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
　　model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
　　model.fit(train_X, train_y)
　　preds_val = model.predict(val_X)
　　mae = mean_absolute_error(val_y, preds_val)
　　return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
　　my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
　　print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

8 循环找出模型最佳参数时的两种代码写法：

常规法：

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
best_size = 5
min_mae = 99999
for max_leaf_nodes in candidate_max_leaf_nodes:
　　mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
　　if mae<min_mae:
　　　　min_mae = mae
　　　　best_size = max_leaf_nodes

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = best_size

字典推导式（dict comprehension）：---------------------还没有完全弄明白

scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)

9 找到模型的最佳参数后，用全部数据X重新fit模型，而不是用被分开的部分数据train_X

# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# fit the final model
final_model.fit(X, y)

posted on 2019-02-26 15:10 wangzhonghan 阅读(424) 评论(0) 收藏举报

刷新页面返回顶部

wangzhonghan

kaggle教程--2--模型评估

导航

公告