监督学习集成模型——随机森林

Boosting与Bagging

Boosting和Bagging都是机器学习中一种集成学习框架。集成学习的意思是将多个弱分类器组合成一个强分类器，这个强分类器能取所有弱分类器之所长，达到相对的最优性能。

Boosting的一般过程如下。以分类问题为例，给定一个训练集，训练弱分类器要比训练强分类器容易很多，从第一个弱分类器开始，Boosting通过训练多个弱分类器，并在训练过程中不断改变训练样本的概率分布，使得每次训练时算法会更加关注上一个弱分类器的错误。通过组合多个这样的弱分类器，便可以获得这样一个近乎完美的强分类器。

Bagging是区别于Boosting的一种集成学习框架，通过对数据集自身采样获取不同子集，并且对每个子集训练基分类器来进行模型集成。Bagging是一种并行化集成学习方法。其核心概念在于自助采样（Bootstrap Sampling），给定包含m个样本的数据集，有放回的随机抽取一个样本放入采样集中，经过m次采样，可得到一个和原始数据集一样大小的采样集。我们可以采样得到T个包含m个样本的采样集，然后基于每个采样集训练出一个基学习器，最后将这些基学习器进行组合。这便是Bagging的主要思想。

可以清楚的看到，Bagging是并行的框架，而Boosting则是序列框架（但也可以实现并行）。

随机森林的基本原理

随机森林，就是有很多棵决策树构建起来的森林，因为构建过程中的随机性，故而称之为随机森林。随机森林算法是Bagging框架的一个典型代表。随机森林的算法过程，简单来说就是两个随机性。具体如下：

假设有M个样本，有放回的随机选择M个样本（每次随机选择一个放回后继续选）。
假设样本有N个特征，在决策时的每个节点需要分裂时，随机地从这N个特征中选取n个特征，满足n<<N，从这n个特征中选择特征进行节点分裂。
基于抽样的M个样本n个特征按照节点分裂的方式构建决策树。
按照1~3步构建大量决策树组成随机森林，然后将每棵树的结果进行综合（分类使用投票法，回归可使用均值法）。

所以，当我们熟悉了Bagging的基本思想和决策树构建的过程后，随机森林就很好理解了。

基于NumPy的随机森林算法实现

定义自助抽样函数

给定输入输出数据集和决策树棵树，通过随机抽样的方式构造多个抽样子集。

# 自助抽样选择训练数据子集
def bootstrap_sampling(X, y):
    # 合并数据输入和标签
    X_y = np.concatenate([X, y.reshape(-1,1)], axis=1)
    # 打乱数据
    np.random.shuffle(X_y)
    # 样本量
    n_samples = X.shape[0]
    # 初始化抽样子集列表
    sampling_subsets = []

    for _ in range(n_estimators):
        # 第一个随机性，行抽样
        idx1 = np.random.choice(n_samples, n_samples, replace=True)
        bootstrap_Xy = X_y[idx1, :]
        bootstrap_X = bootstrap_Xy[:, :-1]
        bootstrap_y = bootstrap_Xy[:, -1]
        sampling_subsets.append([bootstrap_X, bootstrap_y])
    return sampling_subsets

这里以分类树为例构造随机森林。定义一个trees的随机森林决策树列表，通过遍历构造每棵树的方法来构造随机森林。

class ClassificationTree(BinaryDecisionTree):
    ### 定义基尼不纯度计算过程
    def _calculate_gini_impurity(self, y, y1, y2):
        p = len(y1) / len(y)
        gini = calculate_gini(y)
        gini_impurity = p * calculate_gini(y1) + (1-p) * calculate_gini(y2)
        return gini_impurity
    
    ### 多数投票
    def _majority_vote(self, y):
        most_common = None
        max_count = 0
        for label in np.unique(y):
            # 统计多数
            count = len(y[y == label])
            if count > max_count:
                most_common = label
                max_count = count
        return most_common
    
    # 分类树拟合
    def fit(self, X, y):
        self.impurity_calculation = self._calculate_gini_impurity
        self._leaf_value_calculation = self._majority_vote
        super(ClassificationTree, self).fit(X, y)

# 树的棵数
n_estimators = 10
trees = []
# 基于决策树构建森林
for _ in range(n_estimators):
    tree = ClassificationTree(min_samples_split=2, min_gini_impurity=999,
                              max_depth=3)
    trees.append(tree)

基于trees这个决策树列表，来定义随机森林的训练方法。训练时每次自助抽样获得一个子集并遍历拟合trees列表中的每一棵树，最后得到的是包含训练好的每颗决策树构成的随机森林模型。

# 随机森林训练
def fit(X, y):
    # 对森林中每棵树训练一个双随机抽样子集
    n_features = X.shape[1]
    sub_sets = bootstrap_sampling(X, y)
    # 遍历拟合每一棵树
    for i in range(n_estimators):
        sub_X, sub_y = sub_sets[i]
        # 第二个随机性，列抽样
        idx2 = np.random.choice(n_features, max_features, replace=True)
        sub_X = sub_X[:, idx2]
        trees[i].fit(sub_X, sub_y)
        trees[i].feature_indices = idx2
        print('The {}th tree is trained done...'.format(i+1))

将上述过程进行封装，分别定义自助抽样方法、随机森林训练方法和预测方法。完整代码如下：

class RandomForest():
    def __init__(self, n_estimators=100, min_samples_split=2, min_gain=0,
                 max_depth=float("inf"), max_features=None):
        # 树的棵树
        self.n_estimators = n_estimators
        # 树最小分裂样本数
        self.min_samples_split = min_samples_split
        # 最小增益
        self.min_gain = min_gain
        # 树最大深度
        self.max_depth = max_depth
        # 所使用最大特征数
        self.max_features = max_features

        self.trees = []
        # 基于决策树构建森林
        for _ in range(self.n_estimators):
            tree = ClassificationTree(min_samples_split=self.min_samples_split, min_impurity=self.min_gain,
                                      max_depth=self.max_depth)
            self.trees.append(tree)
            
    # 自助抽样
    def bootstrap_sampling(self, X, y):
        X_y = np.concatenate([X, y.reshape(-1,1)], axis=1)
        np.random.shuffle(X_y)
        n_samples = X.shape[0]
        sampling_subsets = []

        for _ in range(self.n_estimators):
            # 第一个随机性，行抽样
            idx1 = np.random.choice(n_samples, n_samples, replace=True)
            bootstrap_Xy = X_y[idx1, :]
            bootstrap_X = bootstrap_Xy[:, :-1]
            bootstrap_y = bootstrap_Xy[:, -1]
            sampling_subsets.append([bootstrap_X, bootstrap_y])
        return sampling_subsets
            
    # 随机森林训练
    def fit(self, X, y):
        # 对森林中每棵树训练一个双随机抽样子集
        sub_sets = self.bootstrap_sampling(X, y)
        n_features = X.shape[1]
        # 设置max_feature
        if self.max_features == None:
            self.max_features = int(np.sqrt(n_features))
        
        for i in range(self.n_estimators):
            # 第二个随机性，列抽样
            sub_X, sub_y = sub_sets[i]
            idx2 = np.random.choice(n_features, self.max_features, replace=True)
            sub_X = sub_X[:, idx2]
            self.trees[i].fit(sub_X, sub_y)
            # 保存每次列抽样的列索引，方便预测时每棵树调用
            self.trees[i].feature_indices = idx2
            print('The {}th tree is trained done...'.format(i+1))
    
    # 随机森林预测
    def predict(self, X):
        # 初始化预测结果列表
        y_preds = []
        # 遍历预测
        for i in range(self.n_estimators):
            idx = self.trees[i].feature_indices
            sub_X = X[:, idx]
            y_pred = self.trees[i].predict(sub_X)
            y_preds.append(y_pred)
        # 对分类结果进行集成    
        y_preds = np.array(y_preds).T
        res = []
        # 取多数类为预测类
        for j in y_preds:
            res.append(np.bincount(j.astype('int')).argmax())
        return res

数据测试

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# 生成模拟二分类数据集
X, y = make_classification(n_samples=1000, n_features=20, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)
rng = np.random.RandomState(2)
X += 2 * rng.uniform(size=X.shape)
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# 创建随机森林模型实例
rf = RandomForest(n_estimators=10, max_features=15)
# 模型训练
rf.fit(X_train, y_train)
# 模型预测
y_pred = rf.predict(X_test)
print(accuracy_score(y_test, y_pred))

基于sklearn的随机森林算法实现

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=3, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

posted @ 2022-08-08 19:46 王陸阅读(226) 评论(0) 收藏举报

刷新页面返回顶部

王陸

我可不是为了被全人类喜欢才活着的，只要对于某一个人来说我是必要的，我就能活下去。

监督学习集成模型——随机森林

Boosting与Bagging

随机森林的基本原理

基于NumPy的随机森林算法实现

基于sklearn的随机森林算法实现

公告