# Regularization on GBDT

### Early Stopping

Early Stopping是机器学习迭代式训练模型中很常见的防止过拟合技巧，维基百科里如下描述:

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent.

XGBoost Python关于early stopping的参数设置文档非常清晰，API如下：

# code snippets from xgboost python-package training.py
def train(..., evals=(), early_stopping_rounds=None)
"""Train a booster with given parameters.
Parameters
----------
early_stopping_rounds: int
Activates early stopping. Validation error needs to decrease at least
every <early_stopping_rounds> round(s) to continue training.
"""


Sklearn的GBDT实现虽然可以添加early stopping，但是比较复杂。官方没有相应的文档和代码样例，必须看源码。实现的时候需要用户提供monitor回调函数，且要了解源码内部_fit_stages函数的locals，总之对新手很不友好：

#code snippets from sklearn.ensemble.gradient_boosting
_LearntSelectorMixin)):
"""Abstract base class for Gradient Boosting. """
...
def fit(self, X, y, sample_weight=None, monitor=None):
"""Fit the gradient boosting model.
Parameters
----------
monitor : callable, optional
The monitor is called after each iteration with the current
iteration, a reference to the estimator and the local variables of
_fit_stages as keyword arguments callable(i, self,
locals()). If the callable returns True the fitting procedure
is stopped. The monitor can be used for various things such as
computing held-out estimates, early stopping, model introspect, and
snapshoting.
"""


### Shrinkage

Shrinkage就是将每棵树的输出结果乘一个因子(0<ν<10<ν<1)，其中ΣJmj=1γjmI(xRjm)Σj=1JmγjmI(x∈Rjm)是第m棵的输出，而f(m)f(m)是前m棵树的ensemble:

fm(x)=fm1(x)+νΣJmj=1γjmI(xRjm)fm(x)=fm−1(x)+ν⋅Σj=1JmγjmI(x∈Rjm)

ESL书中这样讲：

The parameter νν can be regarded as controlling the leanring rate of the boosting procedure

#code snippets from sklearn.ensemble.gradient_boosting
"""Gradient Boosting for classification."""

def __init__(self, ..., learning_rate=0.1, n_estimators=100, ...):
"""
Parameters
----------
learning_rate : float, optional (default=0.1)
learning rate shrinks the contribution of each tree by learning_rate.
There is a trade-off between learning_rate and n_estimators.
n_estimators : int (default=100)
The number of boosting stages to perform. Gradient boosting
is fairly robust to over-fitting so a large number usually
results in better performance
"""


### Subsampling

Subsampling其实源于bootstrap averaging(bagging)思想，GBDT里的做法是在每一轮建树时，样本是从训练集合中无放回随机抽样的ηη部分，典型的ηη值是0.5。这样做既能对模型起正则作用，也能减少计算时间。

#code snippets from sklearn.ensemble.gradient_boosting
"""Gradient Boosting for classification."""

def __init__(self, ..., subsample=1.0, max_features=None,...):
"""
Parameters
----------
subsample : float, optional (default=1.0)
The fraction of samples to be used for fitting the individual base
learners. If smaller than 1.0 this results in Stochastic Gradient
Boosting. subsample interacts with the parameter n_estimators.
Choosing subsample < 1.0 leads to a reduction of variance
and an increase in bias.
max_features : int, float, string or None, optional (default=None)
The number of features to consider when looking for the best split:
"""


### Regularized Learning Objective

L(t)=i=1nl(yi,y(t1)i+ft(xi))+Ω(ft)L(t)=∑i=1nl(yi,yi∗(t−1)+ft(xi))+Ω(ft)

where

Ω(f)=γT+12λ||w||2Ω(f)=γT+12λ||w||2

### Dropout

Dropout是deep learning里很常用的正则化技巧，很自然的我们会想能不能把Dropout用到GBDT模型上呢？AISTATS2015有篇文章DART: Dropouts meet Multiple Additive Regression Trees进行了一些尝试。

Trees added at later iterations tend to impact the prediction of only a few instances, and they make negligible contribution towards the prediction of all the remaining instances. We call this issue of subsequent trees affecting the prediction of only a small fraction of the training instances over-specialization.

DART divergesfrom MART at two places. First, when computing the gradient that the next tree will fit, only a random subset of the existing ensemble is considered. The second place at which DART diverges from MART is when adding the new tree to the ensemble where DART performs a normalization step.

