kaggle教程--2--模型评估

1 平均绝对误差(Mean Absolute Error)(MAE)

from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

 

2 验证集

测试数据,不能是训练数据的一部分,应该是区别于训练数据的数据集,叫验证集

 

3 把原始数据,分为训练数据和测试数据(train_test_split(X, y, random_state = 0))

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

 

4 特征和标签(features and target)

 

5 head()方法,只有panda的对象才有,numpy.ndarray没有

train_X, val_X, train_y, val_y = train_test_split(X,y,random_state = 1)

val_predictions = iowa_model.predict(val_X)

如果要显示预测结果val_predictions和原标签val_y 的前5行:

print(val_predictions[:5])

print(val_y.head())

 

6 过拟合(overfitting)和欠拟合(underfitting)

过拟合是模型在训练集上测试正确率很高,但在新的测试集上测试正确率很低

欠拟合是模型没有捕捉到训练集数据的模式和特征,在训练集和测试集上测试正确率都很低

决策树层数过多会导致过拟合,层数过少会导致欠拟合,我们希望找到一个平衡点的层数

 

7 DecisionTreeRegressor函数的max_leaf_nodes参数,用于找到过拟合和欠拟合的平衡点

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
  model.fit(train_X, train_y)
  preds_val = model.predict(val_X)
  mae = mean_absolute_error(val_y, preds_val)
  return(mae)

for max_leaf_nodes in [5, 50, 500, 5000]:
  my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
  print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

 

8 循环找出模型最佳参数时的两种代码写法:

常规法:

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
best_size = 5
min_mae = 99999
for max_leaf_nodes in candidate_max_leaf_nodes:
  mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
  if mae<min_mae:
    min_mae = mae
    best_size = max_leaf_nodes

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = best_size

字典推导式(dict comprehension):---------------------还没有完全弄明白

scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)

 

9 找到模型的最佳参数后,用全部数据X重新fit模型,而不是用被分开的部分数据train_X

# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# fit the final model
final_model.fit(X, y)

posted on 2019-02-26 15:10  wangzhonghan  阅读(424)  评论(0)    收藏  举报

导航