线性回归模型
线性模型:将 ElasticNet 模型应用在波士顿房价数据,并与 Ridge regression 和Lasso 进行对比,每种方法性能尽量调到最优。
波士顿房价数据
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
import mglearn
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
import numpy as np
boston = load_boston()
print("Data shape: {}".format(boston.data.shape))
print("boston.keys():{}".format(boston.keys()))
print("feature_names:{}".format(boston.feature_names))
Data shape: (506, 13)
boston.keys():dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
feature_names:['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
这是一个506个数据点 13个特征的数据集
Ridge regression 模型
岭回归也是一种用于回归的线性模型,因此它的预测公式与普通最小二乘法相同。但在岭回归中,对系数(w)的选择不仅要在训练数据上得到好的预测结果,而且还要拟合附加约束。我们还希望系数尽量小。换句话说,w 的所有元素都应接近于 0。直观上来看,这意味着每个特征对输出的影响应尽可能小(即斜率很小),同时仍给出很好的预测结果。这种约束是所谓正则化(regularization)的一个例子。正则化是指对模型做显式约束,以
避免过拟合。岭回归用到的这种被称为 L2 正则化。
X, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
ridge=Ridge().fit(X_train,y_train)
ridge10=Ridge(alpha=10).fit(X_train,y_train)
ridge01=Ridge(alpha=0.1).fit(X_train,y_train)
print("alpha=1.0:")
print("Training set score: {:.2f}".format(ridge.score(X_train, y_train)))
print("Test set score: {:.2f}\n".format(ridge.score(X_test, y_test)))
print("alpha=10.0:")
print("Training set score: {:.2f}".format(ridge10.score(X_train, y_train)))
print("Test set score: {:.2f}\n".format(ridge10.score(X_test, y_test)))
print("alpha=0.10:")
print("Training set score: {:.2f}".format(ridge01.score(X_train, y_train)))
print("Test set score: {:.2f}\n".format(ridge01.score(X_test, y_test)))
plt.plot(ridge.coef_, 's', label="Ridge alpha=1")
plt.plot(ridge10.coef_, '^', label="Ridge alpha=10")
plt.plot(ridge01.coef_, 'v', label="Ridge alpha=0.1")
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)
plt.legend()
alpha=1.0:
Training set score: 0.89
Test set score: 0.75
alpha=10.0:
Training set score: 0.79
Test set score: 0.64
alpha=0.10:
Training set score: 0.93
Test set score: 0.77
<matplotlib.legend.Legend at 0x2214a79b1c0>
- 1.线性回归对数据存在过拟合。Ridge(L2正则化)是一种约束更强的模型,所以更不容易过拟合。复杂度更小的模型意味着在训练集上的性能更差,但泛化性能更好。我们更追求的是泛化性能。
- 2.Ridge 模型在模型的简单性(系数都接近于 0)与训练集性能之间做出权衡。简单性和训练集性能二者对于模型的重要程度可以由用户通过设置 alpha 参数来指定(alpha默认为1.0)
- 3.由上图可知参数alpha越小,正则化约束越小,则系数所受限制越小,可见训练集性能更佳,但却存在过拟合的现象。alpha越大,系数越接近0,虽然牺牲训练集性能但可能会提高泛化性能
选取合适的alpha使性能尽量调到最优
for i in range(1,100,1):
i=i/10
ridge_test=Ridge(alpha=i).fit(X_train,y_train)
print("alpha=:"+str(i))
print("Training set score: {:.2f}".format(ridge_test.score(X_train, y_train)))
print("Test set score: {:.2f}\n".format(ridge_test.score(X_test, y_test)))
alpha=:0.1
Training set score: 0.93
Test set score: 0.77
alpha=:0.2
Training set score: 0.92
Test set score: 0.77
alpha=:0.3
Training set score: 0.91
Test set score: 0.77
alpha=:0.4
Training set score: 0.91
Test set score: 0.77
alpha=:0.5
Training set score: 0.90
Test set score: 0.77
alpha=:0.6
Training set score: 0.90
Test set score: 0.76
alpha=:0.7
Training set score: 0.90
Test set score: 0.76
alpha=:0.8
Training set score: 0.89
Test set score: 0.76
alpha=:0.9
Training set score: 0.89
Test set score: 0.76
alpha=:1.0
Training set score: 0.89
Test set score: 0.75
alpha=:1.1
Training set score: 0.88
Test set score: 0.75
alpha=:1.2
Training set score: 0.88
Test set score: 0.75
alpha=:1.3
Training set score: 0.88
Test set score: 0.74
alpha=:1.4
Training set score: 0.87
Test set score: 0.74
alpha=:1.5
Training set score: 0.87
Test set score: 0.74
alpha=:1.6
Training set score: 0.87
Test set score: 0.74
alpha=:1.7
Training set score: 0.87
Test set score: 0.74
.........
alpha=:9.1
Training set score: 0.79
Test set score: 0.64
alpha=:9.2
Training set score: 0.79
Test set score: 0.64
alpha=:9.3
Training set score: 0.79
Test set score: 0.64
alpha=:9.4
Training set score: 0.79
Test set score: 0.64
alpha=:9.5
Training set score: 0.79
Test set score: 0.64
alpha=:9.6
Training set score: 0.79
Test set score: 0.64
alpha=:9.7
Training set score: 0.79
Test set score: 0.64
alpha=:9.8
Training set score: 0.79
Test set score: 0.64
alpha=:9.9
Training set score: 0.79
Test set score: 0.64
目前alpha=0.1数据集性能与泛化性能最佳,尝试继续减小alpha
for i in range(1,10,1):
i=i/100
ridge_test=Ridge(alpha=i).fit(X_train,y_train)
print("alpha=:"+str(i))
print("Training set score: {:.2f}".format(ridge_test.score(X_train, y_train)))
print("Test set score: {:.2f}\n".format(ridge_test.score(X_test, y_test)))
alpha=:0.01
Training set score: 0.77
Test set score: 0.64
alpha=:0.02
Training set score: 0.77
Test set score: 0.64
alpha=:0.03
Training set score: 0.77
Test set score: 0.64
alpha=:0.04
Training set score: 0.77
Test set score: 0.63
alpha=:0.05
Training set score: 0.77
Test set score: 0.63
alpha=:0.06
Training set score: 0.77
Test set score: 0.63
alpha=:0.07
Training set score: 0.77
Test set score: 0.63
alpha=:0.08
Training set score: 0.77
Test set score: 0.63
alpha=:0.09
Training set score: 0.77
Test set score: 0.63
观察可得随着alpha减小,虽然数据集性能不断提升,但泛化性能降低,则可确定alpha=0.1为优参数
Lasso模型
除了 Ridge,还有一种正则化的线性回归是 Lasso。与岭回归相同,使用 lasso 也是约束系数使其接近于 0,但用到的方法不同,叫作 L1 正则化。 L1 正则化的结果是,使用 lasso 时某些系数刚好为 0。这说明某些特征被模型完全忽略。这可以看作是一种自动化的特征选择。某些系数刚好为 0,这样模型更容易解释,也可以呈现模型最重要的特征。
from sklearn.linear_model import Lasso
lasso = Lasso().fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))
Training set score: 0.72
Test set score: 0.55
Number of features used: 11
存在一定的欠拟合,而且特征使用量少,所以我们通过减小alpha和增大max_iter(迭代次数)来优化
lasso0001 = Lasso(alpha=0.0001,max_iter=100000).fit(X_train, y_train)
print("Training set score: {:.2f}".format(lasso0001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso0001.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(lasso0001.coef_ != 0)))
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
经过一定的的参数优化,性能有所提升
在实践中,在两个模型中一般首选岭回归。但如果特征很多,你认为只有其中几个是重要
的,那么选择 Lasso 可能更好。同样,如果你想要一个容易解释的模型,Lasso 可以给出
更容易理解的模型,因为它只选择了一部分输入特征。scikit-learn 还提供了 ElasticNet
类,结合了 Lasso 和 Ridge 的惩罚项。在实践中,这种结合的效果最好,不过代价是要调
节两个参数:一个用于 L1 正则化,一个用于 L2 正则化。
ElasticNet 模型
ela=ElasticNet().fit(X_train,y_train)
print("Training set score: {:.2f}".format(ela.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(ela.coef_ != 0)))
Training set score: 0.72
Test set score: 0.56
Number of features used: 11
弹性网络回归模型,alpha参数默认值为1,l1_ratio参数默认值为0.5。
elastic net在具有多个特征,并且特征之间具有一定关联的数据中比较有用。 l1_ratio:在0到1之间,代表在l1惩罚和l2惩罚之间,如果l1_ratio=1,则为lasso,是调节模型性能的一个重要指标。
由上面的运行结果可知,训练集性能与泛化性能仍比Ridge模型差,与lasso相近存在一定的欠拟合,接下来进行对参数进行优化
调节alpha参数
ela10=ElasticNet(alpha=10,max_iter=100000,l1_ratio=0.5).fit(X_train,y_train)
ela010=ElasticNet(alpha=0.1,max_iter=100000,l1_ratio=0.5).fit(X_train,y_train)
ela001=ElasticNet(alpha=0.01,max_iter=100000,l1_ratio=0.5).fit(X_train,y_train)
print("alpha=10:")
print("Training set score: {:.2f}".format(ela10.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela10.score(X_test, y_test)))
print("Number of features used: {}\n".format(np.sum(ela10.coef_ != 0)))
print("alpha=0.1:")
print("Training set score: {:.2f}".format(ela010.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela010.score(X_test, y_test)))
print("Number of features used: {}\n".format(np.sum(ela010.coef_ != 0)))
print("alpha=0.01:")
print("Training set score: {:.2f}".format(ela001.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela001.score(X_test, y_test)))
print("Number of features used: {}\n".format(np.sum(ela001.coef_ != 0)))
alpha=10:
Training set score: 0.58
Test set score: 0.42
Number of features used: 5
alpha=0.1:
Training set score: 0.76
Test set score: 0.61
Number of features used: 12
alpha=0.01:
Training set score: 0.77
Test set score: 0.62
Number of features used: 13
显然alpha减小,降低正则化会优化性能,与lasso和Ridge相同。对alpha=0.1-0.01之间的参数进行枚举,观察结果
for i in range (1,10,1):
i=i/100
ela_test=ElasticNet(alpha=i,max_iter=100000,l1_ratio=0.5).fit(X_train,y_train)
print("alpha=:"+str(i))
print("Training set score: {:.2f}".format(ela_test.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela_test.score(X_test, y_test)))
print("Number of features used: {}\n".format(np.sum(ela_test.coef_ != 0)))
alpha=:0.01
Training set score: 0.77
Test set score: 0.62
Number of features used: 13
alpha=:0.02
Training set score: 0.76
Test set score: 0.62
Number of features used: 13
alpha=:0.03
Training set score: 0.76
Test set score: 0.61
Number of features used: 13
alpha=:0.04
Training set score: 0.76
Test set score: 0.61
Number of features used: 13
alpha=:0.05
Training set score: 0.76
Test set score: 0.61
Number of features used: 13
alpha=:0.06
Training set score: 0.76
Test set score: 0.61
Number of features used: 13
alpha=:0.07
Training set score: 0.76
Test set score: 0.61
Number of features used: 13
alpha=:0.08
Training set score: 0.76
Test set score: 0.61
Number of features used: 13
alpha=:0.09
Training set score: 0.76
Test set score: 0.61
Number of features used: 12
for i in range (1,10,1):
i=i/10000
ela_test=ElasticNet(alpha=i,max_iter=100000,l1_ratio=0.5).fit(X_train,y_train)
print("alpha=:"+str(i))
print("Training set score: {:.2f}".format(ela_test.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela_test.score(X_test, y_test)))
print("Number of features used: {}\n".format(np.sum(ela_test.coef_ != 0)))
alpha=:0.0001
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
alpha=:0.0002
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0003
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0004
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0005
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0006
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0007
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0008
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
alpha=:0.0009
Training set score: 0.77
Test set score: 0.63
Number of features used: 13
可见alpha=0.0001时性能比较好
进行惩罚比例的调节
for i in range(1,10):
i=i/10
ela_test=ElasticNet(alpha=0.0001,max_iter=100000,l1_ratio=i).fit(X_train,y_train)
print("l1_ratio=:"+str(i))
print("Training set score: {:.2f}".format(ela_test.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ela_test.score(X_test, y_test)))
print("Number of features used: {}\n".format(np.sum(ela_test.coef_ != 0)))
l1_ratio=:0.1
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.2
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.3
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.4
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.5
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.6
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.7
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.8
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
l1_ratio=:0.9
Training set score: 0.77
Test set score: 0.64
Number of features used: 13
将随机森林、梯度提升回归树这两种方法应用在红酒数据集分类
红酒数据集
from sklearn.datasets import load_wine
wine = load_wine()
print("Data shape: {}".format(wine.data.shape))
print("boston.keys():{}".format(wine.keys()))
print("feature_names:{}".format(wine.feature_names))
print("target_names:{}".format(wine.target_names))
Data shape: (178, 13)
boston.keys():dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
feature_names:['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
target_names:['class_0' 'class_1' 'class_2']
这是一个178个数据点,13个特征的数据集,标签是等级0,1,2
随机森林模型
决策树的一个主要缺点在于经常对训练数据过拟合。随机森林是解决这个问题的一种方法。随机森林本质上是许多决策树的集合,其中每棵树都和其他树略有不
同。随机森林背后的思想是,每棵树的预测可能都相对较好,但可能对部分数据过拟合。如果构造很多树,并且每棵树的预测都很好,但都以不同的方式过拟合,那么我们可以对这些树的结果取平均值来降低过拟合。既能减少过拟合又能保持树的预测能力,这可以在数学上严格证明。为了实现这一策略,我们需要构造许多决策树。每棵树都应该对目标值做出可以接受的预测,还应该与其他树不同。随机森林的名字来自于将随机性添加到树的构造过程中,以确保每棵树都各不相同。随机森林中树的随机化方法有两种:一种是通过选择用于构造树的数据点,另一种是通过选择每次划分测试的特征。我们来更深入地研究这一过程。构造随机森林。 想 要 构 造 一 个 随 机 森 林 模 型, 你 需 要 确 定 用 于 构 造 的 树 的 个 数(RandomForestRegressor 或 RandomForestClassifier 的 n_estimators 参数)。比如我们想要构造 10 棵树。这些树在构造时彼此完全独立,算法对每棵树进行不同的随机选择,以确保树和树之间是有区别的。想要构造一棵树,首先要对数据进行自助采样(bootstrap sample)。也就是说,从 n_samples 个数据点中有放回地(即同一样本可以被多次抽取)重复随机抽取一个样本,共抽取 n_samples 次。这样会创建一个与原数据集大小相同的数据集,但有些数据点会缺失(大约三分之一),有些会重复。
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, random_state=0)
forest = RandomForestClassifier(n_estimators=10, random_state=0)
forest.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(forest.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(forest.score(X_test, y_test)))
Accuracy on training set: 1.000
Accuracy on test set: 0.956
可见仍然维持着决策树在训练集的优异表现,并且具有不俗的泛化能力,这是该算法的强大之处,接着尝试调整参数提高泛化能力。我们知道决策树有max_depth(最大深度),而随机森林算法还有 n_estimators 参数(构造树个数),max_features参数(每次划分测试的特征数量)
注意:
- 1.max_depth(默认值为10):控制最大深度,减低随机森林模型的复杂度,有可能提升泛化能力,减小max_depth也称为预剪枝
- 2.max_features(默认值为auto) 决定每棵树的随机性大小,较小的 max_features 可以降低过拟合,提升泛化能力。一般来说,好的经验就是使用默认值:对于分类,默认值是 max_features=sqrt(n_features);对于回归,默认值是 max_features=n_features。增大 max_features 或 max_leaf_nodes 有时也可以提高性能。它还可以大大降低用于训练和预测的时间和空间要求。
- 3.n_estimators(默认值为None) 总是越大越好。对更多的树取平均可以降低过拟合,从而得到鲁棒性更好的集成。不过收益是递减的,而且树越多需要的内存也越多,训练时间也越长。常用的经验法则就是“在你的时间 / 内存允许的情况下尽量多”。
max_depth
from matplotlib.pyplot import MultipleLocator
x=[]
y=[]
z=[]
for i in range(1,11,1):
forest_depth = RandomForestClassifier(max_depth=i,random_state=0)
forest_depth.fit(X_train, y_train)
x.append(i)
y.append(forest_depth.score(X_train, y_train))
z.append(forest_depth.score(X_test, y_test))
fig = plt.figure (figsize= (15, 10))
x_major_locator=MultipleLocator(0.5)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.plot(x,y,"r")
plt.plot(x,z,"g")
plt.xlabel("max_depth")
plt.ylabel("Acurrcy")
plt.legend(['train_score','test_score' ], loc='best')
<matplotlib.legend.Legend at 0x135b45644f0>
深度为1-4训练集性能和泛化能力不断提升,4以后效率几乎不变。为了减低计算复杂度,我们选定max_depth=5为优参数
n_estimators
from matplotlib.pyplot import MultipleLocator
x=[]
y=[]
z=[]
for i in range(5,15,1):
forest_depth = RandomForestClassifier(max_depth=5,n_estimators=i,random_state=0)
forest_depth.fit(X_train, y_train)
x.append(i)
y.append(forest_depth.score(X_train, y_train))
z.append(forest_depth.score(X_test, y_test))
fig = plt.figure (figsize= (15, 10))
x_major_locator=MultipleLocator(0.5)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.plot(x,y,"r")
plt.plot(x,z,"g")
plt.xlabel("n_estimators")
plt.ylabel("Acurrcy")
plt.legend(['train_score','test_score' ], loc='best')
<matplotlib.legend.Legend at 0x135b419a640>
n_estimators=7为优参数
max_features
from matplotlib.pyplot import MultipleLocator
x=[]
y=[]
z=[]
for i in range(1,14,1):
forest_depth = RandomForestClassifier(max_depth=5,n_estimators=7,max_features=i,random_state=0)
forest_depth.fit(X_train, y_train)
x.append(i)
y.append(forest_depth.score(X_train, y_train))
z.append(forest_depth.score(X_test, y_test))
fig = plt.figure (figsize= (15, 10))
x_major_locator=MultipleLocator(0.5)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.plot(x,y,"r")
plt.plot(x,z,"g")
plt.xlabel("Max_feature")
plt.ylabel("Acurrcy")
plt.legend(['train_score','test_score' ], loc='best')
<matplotlib.legend.Legend at 0x135b47f4400>
max_feature=4为优参数 准确率竟达100%
优点、缺点和参数
用于回归和分类的随机森林是目前应用最广泛的机器学习方法之一。这种方法非常强大,通常不需要反复调节参数就可以给出很好的结果,也不需要对数据进行缩放。
从本质上看,随机森林拥有决策树的所有优点,同时弥补了决策树的一些缺陷。仍然使用决策树的一个原因是需要决策过程的紧凑表示。基本上不可能对几十棵甚至上百棵树做出详细解释,随机森林中树的深度往往比决策树还要大(因为用到了特征子集)。因此,如果你需要以可视化的方式向非专家总结预测过程,那么选择单棵决策树可能更好。虽然在大型数据集上构建随机森林可能比较费时间,但在一台计算机的多个 CPU 内核上并行计算也很容易。如果你用的是多核处理器(几乎所有的现代化计算机都是),你可以用 n_jobs 参数来调节使用的内核个数。使用更多的 CPU 内核,可以让速度线性增加(使用 2 个内核,随机森林的训练速度会加倍),但设置 n_jobs 大于内核个数是没有用的。你可以设置 n_jobs=-1 来使用计算机的所有内核。
你应该记住,随机森林本质上是随机的,设置不同的随机状态(或者不设置 random_state参数)可以彻底改变构建的模型。森林中的树越多,它对随机状态选择的鲁棒性就越好。如果你希望结果可以重现,固定 random_state 是很重要的。
对于维度非常高的稀疏数据(比如文本数据),随机森林的表现往往不是很好。对于这种数据,使用线性模型可能更合适。即使是非常大的数据集,随机森林的表现通常也很好,训练过程很容易并行在功能强大的计算机的多个 CPU 内核上。不过,随机森林需要更大的内存,训练和预测的速度也比线性模型要慢。对一个应用来说,如果时间和内存很重要的话,那么换用线性模型可能更为明智。
需要调节的重要参数有 n_estimators 和 max_features,可能还包括预剪枝选项(如 max_depth)。n_estimators 总是越大越好。对更多的树取平均可以降低过拟合,从而得到鲁棒性更好的集成。不过收益是递减的,而且树越多需要的内存也越多,训练时间也越长。常用的经验法则就是“在你的时间 / 内存允许的情况下尽量多”。
前面说过,max_features 决定每棵树的随机性大小,较小的 max_features 可以降低过拟合。一般来说,好的经验就是使用默认值:对于分类,默认值是 max_features=sqrt(n_features);对于回归,默认值是 max_features=n_features。增大 max_features 或 max_leaf_nodes 有时也可以提高性能。它还可以大大降低用于训练和预测的时间和空间要求。
梯度提升回归树
梯度提升回归树是另一种集成方法,通过合并多个决策树来构建一个更为强大的模型。虽然名字中含有“回归”,但这个模型既可以用于回归也可以用于分类。与随机森林方法不同,梯度提升采用连续的方式构造树,每棵树都试图纠正前一棵树的错误。默认情况下,梯度提升回归树中没有随机化,而是用到了强预剪枝。梯度提升树通常使用深度很小(1到 5 之间)的树,这样模型占用的内存更少,预测速度也更快。梯度提升背后的主要思想是合并许多简单的模型(在这个语境中叫作弱学习器),比如深度较小的树。每棵树只能对部分数据做出好的预测,因此,添加的树越来越多,可以不断迭代提高性能。梯度提升树经常是机器学习竞赛的优胜者,并且广泛应用于业界。与随机森林相比,它通常对参数设置更为敏感,但如果参数设置正确的话,模型精度更高除了预剪枝与集成中树的数量之外,梯度提升的另一个重要参数是 learning_rate(学习率),用于控制每棵树纠正前一棵树的错误的强度。较高的学习率意味着每棵树都可以做出较强的修正,这样模型更为复杂。通过增大 n_estimators 来向集成中添加更多树,也可以增加模型复杂度,因为模型有更多机会纠正训练集上的错误。
from sklearn.ensemble import GradientBoostingClassifier
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)
print("Accuracy on training set: {:.3f}".format(gbrt.score(X_train, y_train)))
print("Accuracy on test set: {:.3f}".format(gbrt.score(X_test, y_test)))
Accuracy on training set: 1.000
Accuracy on test set: 0.956
n_estimators
x=[]
y=[]
z=[]
for i in range(1,11,1):
forest_depth = GradientBoostingClassifier(n_estimators=i,random_state=0)
forest_depth.fit(X_train, y_train)
x.append(i)
y.append(forest_depth.score(X_train, y_train))
z.append(forest_depth.score(X_test, y_test))
fig = plt.figure (figsize= (10, 5))
x_major_locator=MultipleLocator(0.5)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.plot(x,y,"r")
plt.plot(x,z,"g")
plt.xlabel("n_estimators")
plt.ylabel("Acurrcy")
plt.legend(['train_score','test_score' ], loc='best')
<matplotlib.legend.Legend at 0x135b58750d0>
n_estimators=2为优参数
learning_rate(0,1)
x=[]
y=[]
z=[]
for i in range(1,11,1):
forest_depth = GradientBoostingClassifier(learning_rate=i/10,n_estimators=2,random_state=0)
forest_depth.fit(X_train, y_train)
x.append(i/10)
y.append(forest_depth.score(X_train, y_train))
z.append(forest_depth.score(X_test, y_test))
fig = plt.figure (figsize= (10, 5))
x_major_locator=MultipleLocator(0.05)
ax=plt.gca()
ax.xaxis.set_major_locator(x_major_locator)
plt.plot(x,y,"r")
plt.plot(x,z,"g")
plt.xlabel("learning_rate")
plt.ylabel("Acurrcy")
plt.legend(['train_score','test_score' ], loc='best')
<matplotlib.legend.Legend at 0x135b57cbb80>
learning_rate=0.5为优参数
优点、缺点和参数。
梯度提升决策树是监督学习中最强大也最常用的模型之一。其主要缺点是需要仔细调参,而且训练时间可能会比较长。与其他基于树的模型类似,这一算法不需要对数据进行缩放就可以表现得很好,而且也适用于二元特征与连续特征同时存在的数据集。与其他基于树的模型相同,它也通常不适用于高维稀疏数据。梯度提升树模型的主要参数包括树的数量 n_estimators 和学习率 learning_rate,后者用于控制每棵树对前一棵树的错误的纠正强度。这两个参数高度相关,因为 learning_rate 越低,就需要更多的树来构建具有相似复杂度的模型。随机森林的 n_estimators 值总是越大越好,但梯度提升不同,增大 n_estimators 会导致模型更加复杂,进而可能导致过拟合。通常的做法是根据时间和内存的预算选择合适的 n_estimators,然后对不同的learning_rate 进行遍历。另一个重要参数是 max_depth(或 max_leaf_nodes),用于降低每棵树的复杂度。梯度提升模型的 max_depth 通常都设置得很小,一般不超过 5。