bootstrap && bagging && 决策树 && 随机森林

看了一篇介绍这几个概念的文章,整理一点点笔记在这里,原文链接:

https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

 

1.Bootstrap Method

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.

就是说,bootstrap是一个统计学习的方法,用来更好的估计一个数据集的某些性质,比如方差和均值,当数据集的数据有一些错误的时候,这样可以提高估计的准确率;

具体的操作就是,创造一个数据集的多个子数据集,然后再各个子数据集上分别计算比如方差,最后将多个计算结果做平均;

 

 

2.Bootstrap Aggregation (Bagging)

是一种集成方法,集成方法就是合并来自多种机器学习预测方法计算的结果的技术,得到的结果比单一的预测结果要好;

 

Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. An algorithm that has high variance are decision trees, like classification and regression trees (CART).

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

可以看出,bagging其实是 bootsrap方法在高方差的算法上的应用,用来降低方差 variance;所以可以得到 5-bagged decision trees 这种;

具体的方法也很简单,和bootstrap差不多,将数据集划分,然后再各个子数据集上分别训练决策树,最后合并决策树的预测结果,例如:

Let’s assume we have a sample dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows.

1.Create many (e.g. 100) random sub-samples of our dataset with replacement.
2.Train a CART model on each sample.
3.Given a new dataset, calculate the average prediction from each model.

 

3.Random Forest 随机森林、

引入随机森林的原因是,在多个子数据集上分别训练决策树,但是决策树都是贪心的,都想寻找最优的划分,导致最后多个决策树之间的相关性很大,这样对最后的结果不好;

所以引入随机森林,每次限制决策树在 split point 可以挑选的特征数量,导致更好的随机;一般来说数量 m:

  • For classification a good default is: m = sqrt(p)
  • For regression a good default is: m = p/3

P是分类问题输入的变量数目,也是特征的数目;

posted on 2018-10-27 12:36  洛珈山下  阅读(967)  评论(0编辑  收藏  举报

导航