交叉验证

为什么交叉验证

一个模型建立起来，首要任务就是要评估这个模型的好坏！然而，交叉验证对模型好坏的评估有至关重要的作用
交叉验证把数据集随机分成训练集和测试集，可以有效评估一个模型的泛化能力

如何交叉验证

导入sklearn.model_selection.train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) ，X为数据集所有特征，y为数据集所有标签

交叉验证的运用

交叉验证运用于决策树(分类)
分类模型要评估的是预测值和实际值匹配率，我理解为准确率，比如我预测10条数据特征，得到10个预测结果标签，有8个预测结果标签和实际标签一样，那么预测准确率为0.8
Using the Iris dataset, we can construct a tree as follows:

from sklearn.datasets import load_iris
from sklearn import tree
from sklearn import metrics
iris = load_iris()
X = iris.data
y =  iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) 
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
pred = clf.predict(X_test)
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred )))

交叉验证运用于线性回归(回归)
回归模型评估的是预测值和实际值之间的误差，是varance，离散度

import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
#Load the diabetes dataset
diabetes = datasets.load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(diabetes.data,diabetes.target, test_size=0.3, random_state=42) 
#Create linear regression object
reg = linear_model.LinearRegression()
#Train the model using the training sets
reg.fit(X_train, y_train)
#Make predictions using the testing set
pred = regr.predict(X_test)
print('Coefficients: \n', reg.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test , pred))
print('Variance score: %.2f' % r2_score(y_test , pred))

k-fold cv

假设我将数据集随机分为相等的10份，第一次用第一份做测试集，其他做训练集，得出score
第二次用第2份做测试集，，其他做训练集，得出score...依次类推，训练10次
这样等于全量数据都做了训练，同事也保证了泛化要求，k-fold cv就是基于类似的思想实现的
GridSearchCV
$sklearn.model_selection.GridSearchCV$
$$P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)}$$

posted @ 2017-11-09 17:14 james.yj 阅读(294) 评论(0) 收藏举报

刷新页面返回顶部

James

In doing we learn.

交叉验证

为什么交叉验证

如何交叉验证

交叉验证的运用

公告