【机器学习实战】-- Titanic 数据集(7)-- Bagging提升 (随机森林)
1. 写在前面:
本篇属于实战部分,更注重于算法在实际项目中的应用。如需对感知机算法本身有进一步的了解,可参考以下链接,在本人学习的过程中,起到了很大的帮助:
【1】Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techiniques to Build Intelligent Systems
【2】sklearn Ensemble methods官方文档 https://scikit-learn.org/stable/modules/ensemble.html#
【3】Bagging与随机森林算法小结 https://www.cnblogs.com/pinard/p/6156009.html
2. 数据集:
数据集地址:https://www.kaggle.com/c/titanic
Titanic数据集是Kaggle上参与人数最多的项目之一。数据本身简单小巧,适合初学者上手,深入了解比较各个机器学习算法。
数据集包含11个变量:PassengerID、Pclass、Name、Sex、Age、SibSp、Parch、Ticket、Fare、Cabin、Embarked,通过这些数据来预测乘客在Titanic事故中是否幸存下来。
3. 算法简介:
Bagging是区别于boosting的另一类集成学习方法,之前有介绍过,boosting算法需要依次训练各个训练器,每一个预测器都在其上一个的基础上进行改进。而Bagging则不同,在Bagging算法中,所有的弱学习器没有依赖关系,可以并行进行学习。
3.1 Bagging算法基本思路
下图很好地展示了Bagging的思路[2]。在Bagging中,每个弱学习器学习的样本是不一样的,特点是“随机采样”,并且大部分情况下是放回的随机采样,也就是说一个输入变量理论上可以可以被重估采样多次,这种随机采样方式也被称为“自举(bootstrap)”。

正是由于随机采样的机制,Bagging算法的一个特点就是,其存在袋外数据(out-of-bag)。假设共有m个样本,随机采样m次,m次采样中一次都没有被采样到的几率是$(1-\frac{}1{m})^{m}$,当$m$足够大时,这一比例为63%,剩下的37%便被称为袋外数据。袋外数据可以当作天然的验证集,来验证算法的泛化性能。
3.2 随机森林算法简介
随机森林是Bagging算法的一种,具体来说就是使用CART决策树作为弱学习器的Bagging算法。
随机森林是一种泛化性能比较好的算法,这主要体现在:
(1)默认进行Bootstrap随机采样
(2)在每一个弱学习器CART决策树中,进行结点划分时,通常是在随机选择的n个特征中的最优特征 (n ≤ 总特征数)
还有一种致力于将泛化性能提升到极致的随机森林变种,被称为 ExtraTrees,它的特点是:
(1)默认不进行Bootstrap随机采样
(2)在每一个弱学习器CART决策树中,进行结点划分时,随机选择n个特征,对每个特征随机选择一个划分值,最优的划分值及其特征被选为划分标准
4. 实战:
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, OrdinalEncoder 5 from sklearn.impute import SimpleImputer 6 from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split, cross_validate 7 from sklearn.pipeline import Pipeline, FeatureUnion 8 from sklearn.tree import DecisionTreeClassifier 9 from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier 10 from sklearn.metrics import accuracy_score, precision_score, recall_score 11 from sklearn.base import BaseEstimator, TransformerMixin 12 13 14 class DataFrameSelector(BaseEstimator, TransformerMixin): 15 def __init__(self, attribute_name): 16 self.attribute_name = attribute_name 17 18 def fit(self, x, y=None): 19 return self 20 21 def transform(self, x): 22 return x[self.attribute_name].values 23 24 25 # Load data 26 data = pd.read_csv('train.csv') 27 28 data_x = data.drop('Survived', axis=1) 29 data_y = data['Survived'] 30 31 # Data cleaning 32 cat_attribs = ['Pclass', 'Sex', 'Embarked'] 33 dis_attribs = ['SibSp', 'Parch'] 34 con_attribs = ['Age', 'Fare'] 35 36 # encoder: OneHotEncoder()、OrdinalEncoder() 37 cat_pipeline = Pipeline([ 38 ('selector', DataFrameSelector(cat_attribs)), 39 ('imputer', SimpleImputer(strategy='most_frequent')), 40 ('encoder', OneHotEncoder()), 41 ]) 42 43 dis_pipeline = Pipeline([ 44 ('selector', DataFrameSelector(dis_attribs)), 45 ('scaler', MinMaxScaler()), 46 ('imputer', SimpleImputer(strategy='most_frequent')), 47 ]) 48 49 con_pipeline = Pipeline([ 50 ('selector', DataFrameSelector(con_attribs)), 51 ('scaler', MinMaxScaler()), 52 ('imputer', SimpleImputer(strategy='mean')), 53 ]) 54 55 full_pipeline = FeatureUnion( 56 transformer_list=[ 57 ('con_pipeline', con_pipeline), 58 ('dis_pipeline', dis_pipeline), 59 ('cat_pipeline', cat_pipeline), 60 ] 61 ) 62 63 data_x_cleaned = full_pipeline.fit_transform(data_x) 64 65 X_train, X_test, y_train, y_test = train_test_split(data_x_cleaned, data_y, stratify=data_y, test_size=0.25, random_state=1992) 66 67 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2) 68 69 # Random Forest 70 clf_rf = RandomForestClassifier(max_depth=3, min_samples_leaf=8, max_features='sqrt', min_samples_split=16, bootstrap=True, random_state=1992, class_weight=None) 71 72 param_grid = [{ 73 'n_estimators': [10, 20, 30, 40, 50], 74 }] 75 76 grid_search = GridSearchCV(clf_rf, param_grid=param_grid, cv=cv, scoring='accuracy', n_jobs=-1, return_train_score=True) 77 grid_search.fit(X_train, y_train) 78 cv_results_grid_search = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score') 79 80 predicted_train_ada = grid_search.predict(X_train) 81 predicted_test_ada = grid_search.predict(X_test) 82 83 print('------------RF grid_search: Results of train------------') 84 print(accuracy_score(y_train, predicted_train_ada)) 85 print(precision_score(y_train, predicted_train_ada)) 86 print(recall_score(y_train, predicted_train_ada)) 87 88 print('------------RF grid search: Results of test------------') 89 print(accuracy_score(y_test, predicted_test_ada)) 90 print(precision_score(y_test, predicted_test_ada)) 91 print(recall_score(y_test, predicted_test_ada)) 92 93 # 导入预测数据,预测结果,并生成csv文件 94 data_test = pd.read_csv('test.csv') 95 submission = pd.DataFrame(columns=['PassengerId', 'Survived']) 96 submission['PassengerId'] = data_test['PassengerId'] 97 98 test_x_cleaned = full_pipeline.fit_transform(data_test) 99 100 submission_RF = pd.DataFrame(submission, copy=True) 101 102 grid_search.best_estimator_.fit(data_x_cleaned, data_y) 103 submission_RF['Survived'] = pd.Series(grid_search.best_estimator_.predict(test_x_cleaned)) 104 105 # submission_RF['Survived'] = pd.Series(grid_search.predict(test_x_cleaned)) 106 107 submission_RF.to_csv('submission_RF_n10_test_with_valid.csv', index=False)
4.1 结果分析:
和前几篇一样,将 RandomForestClassifier 的最优参数用于预测集,并将结果上传kaggle,结果如下(注这里的训练集只是全体训练集,包括代码中的 train 和 test):
可以看到在训练集中,随机森林的accuracy并不是最高的,但在预测集中表现最好,这也体现了其较好的泛化性能。
|
训练集 accuracy |
训练集 precision |
训练集 recall |
预测集 accuracy(需上传kaggle获取结果) |
|
| 朴素贝叶斯最优解 | 0.790 | 0.731 | 0.716 | 0.756 |
| 感知机 | 0.771 | 0.694 | 0.722 | 0.722 |
| 逻辑回归 | 0.807 | 0.781 | 0.690 | 0.768 |
| 线性SVM | 0.801 | 0.772 | 0.684 | 0.773 |
| rbf核SVM | 0.834 | 0.817 | 0.731 | 0.785 |
| AdaBoost | 0.844 | 0.814 | 0.769 | 0.789 |
| GBDT | 0.843 | 0.877 | 0.687 | 0.778 |
| RF | 0.820 | 0.917 | 0.585 | 0.792 |
浙公网安备 33010602011771号