【机器学习实战】-- Titanic 数据集(8)-- Boosting提升 (XGBOOST)

1. 写在前面: 

本篇属于实战部分,更注重于算法在实际项目中的应用。如需对感知机算法本身有进一步的了解,可参考以下链接,在本人学习的过程中,起到了很大的帮助:

【1】XGBoost算法原理小结 https://www.cnblogs.com/pinard/p/10979808.html

【2】XGBoost Parameters https://xgboost.readthedocs.io/en/latest/parameter.html

【3】XGBoost Python API Reference https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

 

2. 数据集:

数据集地址:https://www.kaggle.com/c/titanic

Titanic数据集是Kaggle上参与人数最多的项目之一。数据本身简单小巧,适合初学者上手,深入了解比较各个机器学习算法。

数据集包含11个变量:PassengerID、Pclass、Name、Sex、Age、SibSp、Parch、Ticket、Fare、Cabin、Embarked,通过这些数据来预测乘客在Titanic事故中是否幸存下来。

 

3. 算法简介:

XGBoost实际上是对GBDT的优化,GBDT利用了泰勒一阶展开,拟合负梯度;而XGBoost则对损失函数泰勒二阶展开,一次求解出决策树最优的所有J个叶子节点区域$R_{tj}$和每个叶子节点区域的最优解$c_{tj}$。

 

4. 实战:

  1 import pandas as pd
  2 import numpy as np
  3 import matplotlib.pyplot as plt
  4 
  5 from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, OrdinalEncoder
  6 from sklearn.impute import SimpleImputer
  7 from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split, cross_validate
  8 from sklearn.pipeline import Pipeline, FeatureUnion
  9 from sklearn.tree import DecisionTreeClassifier
 10 from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
 11 from sklearn.metrics import accuracy_score, precision_score, recall_score, auc
 12 from sklearn.base import BaseEstimator, TransformerMixin
 13 from sklearn.utils import class_weight
 14 
 15 import xgboost as xgb
 16 
 17 
 18 class DataFrameSelector(BaseEstimator, TransformerMixin):
 19     def __init__(self, attribute_name):
 20         self.attribute_name = attribute_name
 21 
 22     def fit(self, x, y=None):
 23         return self
 24 
 25     def transform(self, x):
 26         return x[self.attribute_name].values
 27 
 28 
 29 # Load data
 30 data = pd.read_csv('train.csv')
 31 
 32 data_x = data.drop('Survived', axis=1)
 33 data_y = data['Survived']
 34 
 35 # Data cleaning
 36 cat_attribs = ['Pclass', 'Sex', 'Embarked']
 37 dis_attribs = ['SibSp', 'Parch']
 38 con_attribs = ['Age', 'Fare']
 39 
 40 # encoder: OneHotEncoder()、OrdinalEncoder()
 41 cat_pipeline = Pipeline([
 42     ('selector', DataFrameSelector(cat_attribs)),
 43     ('imputer', SimpleImputer(strategy='most_frequent')),
 44     ('encoder', OneHotEncoder()),
 45 ])
 46 
 47 dis_pipeline = Pipeline([
 48     ('selector', DataFrameSelector(dis_attribs)),
 49     ('scaler', MinMaxScaler()),
 50     ('imputer', SimpleImputer(strategy='most_frequent')),
 51 ])
 52 
 53 con_pipeline = Pipeline([
 54     ('selector', DataFrameSelector(con_attribs)),
 55     ('scaler', MinMaxScaler()),
 56     ('imputer', SimpleImputer(strategy='mean')),
 57 ])
 58 
 59 full_pipeline = FeatureUnion(
 60     transformer_list=[
 61         ('con_pipeline', con_pipeline),
 62         ('dis_pipeline', dis_pipeline),
 63         ('cat_pipeline', cat_pipeline),
 64     ]
 65 )
 66 
 67 data_x_cleaned = full_pipeline.fit_transform(data_x)
 68 
 69 X_train, X_test, y_train, y_test = train_test_split(data_x_cleaned, data_y, stratify=data_y, test_size=0.25, random_state=1992)
 70 
 71 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
 72 
 73 # XGBoost
 74 xgb_cla = xgb.XGBClassifier(use_label_encoder=False, verbosity=1, objective='binary:logistic', random_state=1992)
 75 cls_wt = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)  # cls_wt[1]/cls_wt[0]
 76 
 77 param_grid = [{
 78         'learning_rate': [0.5],
 79         'n_estimators': [5],
 80         'max_depth': [5],
 81         'min_child_weight': [5],
 82         'gamma': [5],
 83         'scale_pos_weight': [1],
 84         'subsample': [0.8],
 85                 }]
 86 
 87 grid_search = GridSearchCV(xgb_cla, param_grid=param_grid, cv=cv, scoring='accuracy', n_jobs=-1, return_train_score=True)
 88 grid_search.fit(X_train, y_train)
 89 cv_results_grid_search = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score')
 90 
 91 predicted_train_xgb = grid_search.predict(X_train)
 92 predicted_test_xgb = grid_search.predict(X_test)
 93 
 94 print('------------XGBoost grid_search: Results of train------------')
 95 print(accuracy_score(y_train, predicted_train_xgb))
 96 print(precision_score(y_train, predicted_train_xgb))
 97 print(recall_score(y_train, predicted_train_xgb))
 98 
 99 print('------------XGBoost grid search: Results of test------------')
100 print(accuracy_score(y_test, predicted_test_xgb))
101 print(precision_score(y_test, predicted_test_xgb))
102 print(recall_score(y_test, predicted_test_xgb))
View Code

 

4.1 xgboost中自定义eval_metric:

和 Adaboost 和 GBDT 一样,也想将每一步的分步强学习器在训练集以及验证集上的表现。由于xgboost并没有直接的staged_predict方法,这就需要用到参数 “eval_metric”,来指定 evaluation metric,其自带的有 “error”,在这里我也自己写了一个 “accuracy” 的eval metric, 正好可以和 "error" 进行对比。然而在这里我发现了一个问题:我们之前设定的 objective 是 “binary:logistic”,根据官方文档:binary:logistic: logistic regression for binary classification, output probability,然而实际上它只是在使用 predict 方法时,输出的是 probability,输入到 eval_metric 中的所谓的“y_predicted”,其实是 $f(x)$,也就是未经过 transformation 的值,因此在 eval_metric 中的定义需要小心。

 

 1 def metric_precision(y_predicted, y_true):
 2     label = y_true.get_label()
 3     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
 4 
 5     print('max_y_predicted:%f' % max(y_predicted))
 6     print('min_y_predicted:%f' % min(y_predicted))
 7 
 8     precision = precision_score(label, predicted_binary)
 9     return 'Precision', precision
10 
11 
12 def metric_accuracy(y_predicted, y_true):
13     label = y_true.get_label()
14     # predicted_binary = np.round(y_predicted)
15     # y_predicted = 1 / (1 + np.exp(-y_predicted))
16     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
17 
18     print('max_y_predicted:%f' % max(y_predicted))
19     print('min_y_predicted:%f' % min(y_predicted))
20 
21     accuracy = accuracy_score(label, predicted_binary)
22     return 'Accuracy', accuracy
23 
24 
25 def metric_recall(y_predicted, y_true):
26     label = y_true.get_label()
27     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
28 
29     print('max_y_predicted:%f' % max(y_predicted))
30     print('min_y_predicted:%f' % min(y_predicted))
31 
32     recall = recall_score(label, predicted_binary)
33     return 'Recall', recall
34 
35 
36 xgb_cla2 = grid_search.best_estimator_
37 temp2 = xgb_cla2.predict(X_train)
38 temp2_pro = xgb_cla2.predict_proba(X_train)
39 xgb_cla2.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=['error'])
40 
41 xgb_cla2.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=metric_accuracy)
42 
43 plt.figure()
44 plt.plot(xgb_cla2.evals_result()['validation_1']['Accuracy'], label='test')
45 plt.plot(xgb_cla2.evals_result()['validation_0']['Accuracy'], label='train')
46 
47 plt.legend()
48 plt.grid(axis='both')
eval_metric & Plot

在训练集上,经过grid_search, n_estiamtors(sklearn api) / num_boost_round(learning api) 为5最优。可以发现在测试集上,同样是在5时取得最优解,后续 test_accuracy 不再提升,而 train_accuracy 进一步提升,逐渐过拟合。

4.2 结果分析:

和前几篇一样,将 XGBoost 的最优参数用于预测集,并将结果上传kaggle,结果如下(注这里的训练集只是全体训练集,包括代码中的 train 和 test):

可以看到其效果和AdaBoost、GBDT差不多,略差于RF(随机森林)。

 

训练集

accuracy

训练集

precision

训练集

recall

预测集

accuracy(需上传kaggle获取结果)

朴素贝叶斯最优解 0.790 0.731 0.716 0.756
感知机 0.771 0.694 0.722 0.722
逻辑回归 0.807 0.781 0.690 0.768
线性SVM 0.801 0.772 0.684 0.773
rbf核SVM 0.834 0.817 0.731 0.785
AdaBoost 0.844 0.814 0.769 0.789
GBDT 0.843 0.877 0.687 0.778
RF 0.820 0.917 0.585 0.792
XGBoost 0.831 0.847  0.681  0.780
posted @ 2021-03-12 21:41  古渡镇  阅读(351)  评论(0)    收藏  举报