Boost学习器 - AdaBoost 和 GBDT

本文将介绍机器学习中比较基础的两种Boost方法 — AdaBoost 和 GBDT

AdaBoost

简介

Boosting, 也称为增强学习或提升法，是一种重要的集成学习技术，能够将预测精度仅比随机猜度略高的弱学习器增强为预测精度高的强学习器，这在直接构造强学习器非常困难的情况下，为学习算法的设计提供了一种有效的新思路和新方法。其中最为成功应用的是，Yoav Freund和Robert Schapire在1995年提出的AdaBoost算法。

AdaBoost是英文"Adaptive Boosting"（自适应增强）的缩写，它的自适应在于：前一个基本分类器被错误分类的样本的权值会增大，而正确分类的样本的权值会减小，并再次用来训练下一个基本分类器。同时，在每一轮迭代中，加入一个新的弱分类器，直到达到某个预定的足够小的错误率或达到预先指定的最大迭代次数才确定最终的强分类器。

算法步骤

（1）首先，是初始化训练数据的权值分布D1。假设有N个训练样本数据，则每一个训练样本最开始时，都被赋予相同的权值：w1=1/N。
（2）然后，训练弱分类器hi。具体训练过程中是：如果某个训练样本点，被弱分类器hi准确地分类，那么在构造下一个训练集中，它对应的权值要减小；相反，如果某个训练样本点被错误分类，那么它的权值就应该增大。权值更新过的样本集被用于训练下一个分类器，整个训练过程如此迭代地进行下去。
（3）最后，将各个训练得到的弱分类器组合成一个强分类器。各个弱分类器的训练过程结束后，加大分类误差率小的弱分类器的权重，使其在最终的分类函数中起着较大的决定作用，而降低分类误差率大的弱分类器的权重，使其在最终的分类函数中起着较小的决定作用。
换而言之，误差率低的弱分类器在最终分类器中占的权重较大，否则较小。

示例

通过Sklearn上的一个例子：介绍Adaboost工作的过程：

import pandas as pd

penguins = pd.read_csv("../datasets/penguins_classification.csv")
culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

data, target = penguins[culmen_columns], penguins[target_column]

有目的地训练一个浅决策树。由于它是浅的，因此不太可能过度拟合，并且某些训练示例甚至会被错误分类。

import seaborn as sns
from sklearn.tree import DecisionTreeClassifier

palette = ["tab:red", "tab:blue", "black"]

tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(data, target)

画图并查看错误分类情况

import numpy as np

target_predicted = tree.predict(data)
misclassified_samples_idx = np.flatnonzero(target != target_predicted)
data_misclassified = data.iloc[misclassified_samples_idx]

import matplotlib.pyplot as plt
from helpers.plotting import DecisionBoundaryDisplay

DecisionBoundaryDisplay.from_estimator(
    tree, data, response_method="predict", cmap="RdBu", alpha=0.5
)

# plot the original dataset
sns.scatterplot(data=penguins, x=culmen_columns[0], y=culmen_columns[1],
                hue=target_column, palette=palette)
# plot the misclassified samples
sns.scatterplot(data=data_misclassified, x=culmen_columns[0],
                y=culmen_columns[1], label="Misclassified samples",
                marker="+", s=150, color="k")

plt.legend(bbox_to_anchor=(1.04, 0.5), loc="center left")
_ = plt.title("Decision tree predictions \nwith misclassified samples "
              "highlighted")

我们观察到一些样本被分类器错误分类。
sample_weight参数是在调用时设置的，使它在训练期间更加关注具有更高权重的样本。我们将使用此技巧通过"丢弃"所有正确分类的样本并仅考虑错误分类的样本来创建新的分类器。因此，错误分类的样品将被分配权重为1，分类良好的样品将被分配权重为0。classifier.fit(X, y, sample_weight=weights)

sample_weight = np.zeros_like(target, dtype=int)
sample_weight[misclassified_samples_idx] = 1

tree = DecisionTreeClassifier(max_depth=2, random_state=0)
tree.fit(data, target, sample_weight=sample_weight)

DecisionBoundaryDisplay.from_estimator(
    tree, data, response_method="predict", cmap="RdBu", alpha=0.5
)
sns.scatterplot(data=penguins, x=culmen_columns[0], y=culmen_columns[1],
                hue=target_column, palette=palette)
sns.scatterplot(data=data_misclassified, x=culmen_columns[0],
                y=culmen_columns[1],
                label="Previously misclassified samples",
                marker="+", s=150, color="k")

plt.legend(bbox_to_anchor=(1.04, 0.5), loc="center left")
_ = plt.title("Decision tree by changing sample weights")

我们看到决策功能发生了巨大变化。从定性上讲，我们看到以前错误分类的样本现在被正确分类。
以不同的方式对每个分类器的预测进行加权，通过使用每个分类器所犯的错误数量。使用分类误差来组合这两棵树。

ensemble_weight = [
    (target.shape[0] - len(misclassified_samples_idx)) / target.shape[0],
    (target.shape[0] - len(newly_misclassified_samples_idx)) / target.shape[0],
]
ensemble_weight

输出：[0.935672514619883, 0.6929824561403509]
第一个分类器准确率为94%，第二个分类器准确率为69%。因此，在预测一个类时，我们应该比第二个类稍微信任第一个分类器。我们可以使用这些精度值来加权每个学习者的预测。
总而言之，Boosting学习了几个分类器，每个分类器或多或少都会关注数据集的特定样本。因此，提升与打包不同：在这里，我们从不对数据集重新采样，我们只是为原始数据集分配不同的权重。
提升需要一些策略将学习者组合在一起：

需要定义一种方法来计算要分配给样本的权重;
在进行预测时，需要为每个学习者分配一个权重。

使用在 scikit-learn 中实现的 AdaBoost 分类器，并查看训练的基础决策树分类器。

from sklearn.ensemble import AdaBoostClassifier

base_estimator = DecisionTreeClassifier(max_depth=3, random_state=0)
adaboost = AdaBoostClassifier(base_estimator=base_estimator,
                              n_estimators=3, algorithm="SAMME",
                              random_state=0)
adaboost.fit(data, target)

for boosting_round, tree in enumerate(adaboost.estimators_):
    plt.figure()
    # we convert `data` into a NumPy array to avoid a warning raised in scikit-learn
    DecisionBoundaryDisplay.from_estimator(
        tree, data.to_numpy(), response_method="predict", cmap="RdBu", alpha=0.5
    )
    sns.scatterplot(x=culmen_columns[0], y=culmen_columns[1],
                    hue=target_column, data=penguins,
                    palette=palette)
    plt.legend(bbox_to_anchor=(1.04, 0.5), loc="center left")
    _ = plt.title(f"Decision tree trained at round {boosting_round}")

print(f"Weight of each classifier: {adaboost.estimator_weights_}")
print(f"Error of each classifier: {adaboost.estimator_errors_}")

Output:

Weight of each classifier: [3.58351894 3.46901998 3.03303773]
Error of each classifier: [0.05263158 0.05864198 0.08787269]

我们看到AdaBoost学习了三种不同的分类器，每个分类器都专注于不同的样本。查看每个学习者的权重，我们看到融合赋予第一个分类器最高的权重。当我们查看每个分类器的错误时，这确实是有道理的。第一个分类器还具有最高的分类泛化性能。

算法参数

算法中几个重要的超参：

base_estimator object, default=None
基学习器，需要支持Sample Weight参数。如果不指定，则默认为map_depth=1的决策树分类器
n_estimators int, default=50
学习器的数量
learning_rate float, default=1.0
学习率，学习率越高每个树的贡献度越高，需要掌握学习率和estimators数目的平衡
algorithm{‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’
Adaboost 提供了两种不同的算法，SAMME和SAMMER，参考博客：https://blog.csdn.net/weixin_43298886/article/details/110927084
random_state int, RandomState instance or None, default=None
每个学习器都会用到的参数，不再赘述

GBDT

简介

GBDT主要思想是不断拟合残差。GBDT利用了提升树的思想，是提升树的一种，其是利用梯度下降法拟合残差。
GBDT（Gradient Boosting Decision Tree）是梯度提升树，Gradient主要体现在：使用损失函数的负梯度在当前模型的值作为回归问题提升树算法的残差近似值。负梯度为：

算法步骤

（1）初始化弱学习器
（2）对m = 1 , 2 , . . . , M ：
a）对每个样本i = 1 , 2 , . . . , N i=1,2,...,Ni=1,2,...,N，计算负梯度，即残差

b) 将上步得到的残差作为样本新的真实值，并将数据
作为下棵树的训练数据，得到一颗新的回归树f m ( x )其对应的叶子节点区域为R j m , j = 1 , 2 , . . . , J 其中J为回归树t的叶子节点的个数。
c）对叶子区域j = 1 , 2 , . . J计算最佳拟合值

d）更新强学习器

（3）得到最终学习器

示例

示例用GBDT做回归：
生成一组用于回归的数据：

import pandas as pd
import numpy as np

# Create a random number generator that will be used to set the randomness
rng = np.random.RandomState(0)


def generate_data(n_samples=50):
    """Generate synthetic dataset. Returns `data_train`, `data_test`,
    `target_train`."""
    x_max, x_min = 1.4, -1.4
    len_x = x_max - x_min
    x = rng.rand(n_samples) * len_x - len_x / 2
    noise = rng.randn(n_samples) * 0.3
    y = x ** 3 - 0.5 * x ** 2 + noise

    data_train = pd.DataFrame(x, columns=["Feature"])
    data_test = pd.DataFrame(np.linspace(x_max, x_min, num=300),
                             columns=["Feature"])
    target_train = pd.Series(y, name="Target")

    return data_train, data_test, target_train


data_train, data_test, target_train = generate_data()

import matplotlib.pyplot as plt
import seaborn as sns

sns.scatterplot(x=data_train["Feature"], y=target_train, color="black",
                alpha=0.5)
_ = plt.title("Synthetic regression dataset")

首先创建一个决策树回归器。我们将设置树的深度，以便生成的学习者将欠拟合数据。

from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(max_depth=3, random_state=0)
tree.fit(data_train, target_train)

target_train_predicted = tree.predict(data_train)
target_test_predicted = tree.predict(data_test)

此处使用术语"测试"是指未用于训练的数据。它不应与来自训练测试拆分的数据混淆，因为它是以等间间隔生成的，用于预测的可视化评估。

# plot the data
sns.scatterplot(x=data_train["Feature"], y=target_train, color="black",
                alpha=0.5)
# plot the predictions
line_predictions = plt.plot(data_test["Feature"], target_test_predicted, "--")

# plot the residuals
for value, true, predicted in zip(data_train["Feature"],
                                  target_train,
                                  target_train_predicted):
    lines_residuals = plt.plot([value, value], [true, predicted], color="red")

plt.legend([line_predictions[0], lines_residuals[0]],
           ["Fitted tree", "Residuals"])
_ = plt.title("Prediction function together \nwith errors on the training set")

由于树对数据的拟合不足，因此其在训练数据上的准确性远非完美。我们可以通过查看预测和基本事实数据之间的差异来观察图中的这一点。我们用不间断的红线表示这些错误，称为"残差"。
事实上，我们最初的树没有足够的表现力来处理数据的复杂性，如残差所示。在梯度提升算法中，我们的想法是创建第二棵树，在给定相同的数据的情况下，该树将尝试预测残差而不是向量。因此，我们将有一棵树能够预测初始树所犯的错误。
我们训练这样一棵树datatarget。

residuals = target_train - target_train_predicted

tree_residuals = DecisionTreeRegressor(max_depth=5, random_state=0)
tree_residuals.fit(data_train, residuals)

target_train_predicted_residuals = tree_residuals.predict(data_train)
target_test_predicted_residuals = tree_residuals.predict(data_test)

sns.scatterplot(x=data_train["Feature"], y=residuals, color="black", alpha=0.5)
line_predictions = plt.plot(
    data_test["Feature"], target_test_predicted_residuals, "--")

# plot the residuals of the predicted residuals
for value, true, predicted in zip(data_train["Feature"],
                                  residuals,
                                  target_train_predicted_residuals):
    lines_residuals = plt.plot([value, value], [true, predicted], color="red")

plt.legend([line_predictions[0], lines_residuals[0]],
           ["Fitted tree", "Residuals"], bbox_to_anchor=(1.05, 0.8),
           loc="upper left")
_ = plt.title("Prediction of the previous residuals")

我们看到这棵新树只能设法适应一些残差。我们将重点关注训练集中的特定样本（即，我们知道使用两个连续的树可以很好地预测样本）。我们将使用此示例来解释如何组合两棵树的预测。让我们首先在data_train中选择此示例。

sample = data_train.iloc[[-2]]
x_sample = sample['Feature'].iloc[0]
target_true = target_train.iloc[-2]
target_true_residual = residuals.iloc[-2]

让我们绘制前面的信息，并突出显示我们感兴趣的示例。让我们首先绘制原始数据和第一个决策树的预测。

# Plot the previous information:
#   * the dataset
#   * the predictions
#   * the residuals

sns.scatterplot(x=data_train["Feature"], y=target_train, color="black",
                alpha=0.5)
plt.plot(data_test["Feature"], target_test_predicted, "--")
for value, true, predicted in zip(data_train["Feature"],
                                  target_train,
                                  target_train_predicted):
    lines_residuals = plt.plot([value, value], [true, predicted], color="red")

# Highlight the sample of interest
plt.scatter(sample, target_true, label="Sample of interest",
            color="tab:orange", s=200)
plt.xlim([-1, 0])
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
_ = plt.title("Tree predictions")

现在，让我们绘制残差信息。我们将绘制从第一个决策树计算出的残差，并显示残差预测。

# Plot the previous information:
#   * the residuals committed by the first tree
#   * the residual predictions
#   * the residuals of the residual predictions

sns.scatterplot(x=data_train["Feature"], y=residuals,
                color="black", alpha=0.5)
plt.plot(data_test["Feature"], target_test_predicted_residuals, "--")
for value, true, predicted in zip(data_train["Feature"],
                                  residuals,
                                  target_train_predicted_residuals):
    lines_residuals = plt.plot([value, value], [true, predicted], color="red")

# Highlight the sample of interest
plt.scatter(sample, target_true_residual, label="Sample of interest",
            color="tab:orange", s=200)
plt.xlim([-1, 0])
plt.legend()
_ = plt.title("Prediction of the residuals")

对于我们感兴趣的示例，我们的初始树正在产生错误（小残差）。当拟合第二棵树时，在这种情况下，残差被完美地拟合和预测。我们将使用拟合树定量检查此预测。首先，让我们检查初始树的预测，并将其与真实值进行比较。

print(f"True value to predict for "
      f"f(x={x_sample:.3f}) = {target_true:.3f}")

y_pred_first_tree = tree.predict(sample)[0]
print(f"Prediction of the first decision tree for x={x_sample:.3f}: "
      f"y={y_pred_first_tree:.3f}")
print(f"Error of the tree: {target_true - y_pred_first_tree:.3f}")

Output:
True value to predict for f(x=-0.517) = -0.393
Prediction of the first decision tree for x=-0.517: y=-0.145
Error of the tree: -0.248
现在，我们可以使用第二棵树来尝试预测此残差。

print(f"Prediction of the residual for x={x_sample:.3f}: "
      f"{tree_residuals.predict(sample)[0]:.3f}")

Output:
Prediction of the residual for x=-0.517: -0.248
我们看到我们的第二棵树能够预测第一棵树的确切残差（误差）。因此，我们可以通过对融合中所有树的预测求和来预测的值。x

y_pred_first_and_second_tree = (
    y_pred_first_tree + tree_residuals.predict(sample)[0]
)
print(f"Prediction of the first and second decision trees combined for "
      f"x={x_sample:.3f}: y={y_pred_first_and_second_tree:.3f}")
print(f"Error of the tree: {target_true - y_pred_first_and_second_tree:.3f}")

我们选择了一个样本，其中只有两棵树足以做出完美的预测。但是，我们在前面的图中看到，两棵树不足以校正所有样本的残差。因此，需要向融合中添加多个树才能成功纠正错误（即，第二棵树纠正第一棵树的错误，而第三棵树纠正第二棵树的错误，依此类推）。

算法参数

GDBT的超参比较多，这里介绍几个比较独有而且重要的参数

loss{'squared_error'， 'absolute_error'， 'huber'， 'quantile'}， default='squared_error'
"squared_error"是指回归的平方误差。"absolute_error"是指回归的绝对误差，是一个稳健的损失函数。"huber"是两者的结合。"quantile"允许分位数回归（用于指定分位数）。
subsample, float default=1.0
用于拟合单个基础学习者的样本分数。如果小于 1.0，则会导致随机梯度提升。选择会导致差异的减少和偏差的增加。

posted @ 2022-03-28 21:18 Asp1rant 阅读(419) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Asp1rant

Boost学习器 - AdaBoost 和 GBDT

AdaBoost

简介

算法步骤

示例

算法参数

GBDT

简介

算法步骤

示例

算法参数

公告