# Kaggle竞赛入门（三）：用Python处理过拟合和欠拟合，得到最佳模型

## 一.什么是过拟合和欠拟合？

这张图的横坐标是Tree Depth,可以理解为我们模型符合训练集的程度，蓝色线条代表训练集的Error，也就是平均误差，红色线条代表验证集的平均误差。可以看到中间这条灰色的感叹号线这里验证集的平均误差是最小的，我们最终拟合出来的模型是为了能够让新的数据在这个模型上表现更好，而不是让训练集，因此红色线得到的最小值就是我们的最好模型所在点。模型越符合训练集，那么训练集的平均误差就会越来越小，反之，验证集则会先减小后增大，因为模型最开始欠拟合，超过一定限度就是过拟合了，这也比较符合逻辑。那么我们如何用代码来选取在验证集最小处的最佳模型呢？

## 二.例子

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)

# Data Loading Code Runs At This Point
import pandas as pd

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# Filter rows with missing values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5           Mean Absolute Error:  347380
Max leaf nodes: 50           Mean Absolute Error:  258171
Max leaf nodes: 500           Mean Absolute Error:  243495
Max leaf nodes: 5000           Mean Absolute Error:  254983

posted @ 2020-04-05 15:44  Geeksongs  阅读(...)  评论(...编辑  收藏