极度提升算法 xgboost

xgboot是个人认为最全能的机器学习算法核心思想是贪心策略+最优化（二次最优化）背包问题预测函数的loss最优

一、

1. 分类：信息增益、信息增益率、基尼系数来判定树的节点分裂结果是类别

2.回归：预测误差，常用的有均方误差、对数误差等结果是数值

二、bagging boosting

1.都是集成学习

2.前者各个节点独立后者前节点影响后节点

3. 停止迭代 a .当引入的分裂带来的增益小于一个阀值的时候 b. 当树达到最大深度时则停止建立决策树 c. 当样本权重和小于设定阈值时则停止建树

1． objective[默认是reg：linear]
　　这个参数定义需要被最小化的损失函数。binary：logistic二分类的逻辑回归，返回预测的概率非类别。

　　　　　　　　　　　　　　　　　　multi:softmax使用softmax的多分类器，返回预测的类别。在这种情况下，你还要多设置一个参数：num_class类别数目。
　　2． eval_metric[默认值取决于objective参数的取之]
　　对于有效数据的度量方法。对于回归问题，默认值是rmse，对于分类问题，默认是error。

　　典型值有：rmse均方根误差；mae平均绝对误差；logloss负对数似然函数值；

　　error二分类错误率；merror多分类错误率；mlogloss多分类损失函数；auc曲线下面积。
　　3． seed[默认是0]
　　　　随机数的种子，设置它可以复现随机数据的结果，也可以用于调整参数。

三、例子

1. 回归任务

model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=False, objective='reg:gamma')

2. 多分类任务

a. 调用原生接口

# coding: utf-8
"""
and 前真返后
or 前真返前
"""
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pylab as plt
# %matplotlib inline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets.samples_generator import make_classification
def e(a):
    c = a > 0.5 and 1 or 0
    return c
# X为样本特征，y为样本类别输出， 共10000个样本，每个样本20个特征，
# 输出有2个类别，没有冗余特征，每个类别一个簇
x, y = make_classification(n_samples=10000, n_features=20, n_redundant=0,
                           n_clusters_per_class=1, n_classes=2, flip_y=0.1)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
xgbTrain = xgb.DMatrix(x_train, y_train)
xgbTest = xgb.DMatrix(x_test, y_test)
evallist = [(xgbTrain, 'train'), (xgbTest, 'eval')]
param = {'max_depth': 5, 'eta': 0.5, 'verbosity': 1, 'objective': 'binary:logistic'}
raw_model = xgb.train(param, xgbTrain, num_boost_round=20, evals=evallist)
pred = raw_model.predict(xgbTest)
print(pred)  # [0.13218588 0.98173666 0.85143584 ... 0.2700139  0.04027747 0.10013721]
res = [e(p) for p in pred]
print(accuracy_score(xgbTest.get_label(), res))

b. 调用sklearn接口

# conding: utf-8
from matplotlib import pyplot
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, KFold
import xgboost as xgb
from xgboost import plot_importance
kf = KFold(5, True, 10)
x, y = make_classification(n_samples=10000, n_features=20, n_redundant=0,
                           n_clusters_per_class=1, n_classes=2, flip_y=0.1)

for train_index, test_index in kf.split(x):
    print('训练集:{}, 测试集:{}'.format(train_index, test_index))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

model = xgb.XGBClassifier(max_depth=5, learning_rate=0.5,
                          random_state=0)
model.fit(x_train, y_train, early_stopping_rounds=10, eval_metric="error",
          eval_set=[(x_test, y_test)])
pred = model.predict(x_test)
importances = model.feature_importances_
model = SelectFromModel(model, threshold='{}*mean'.format(0.8), prefit=True)
# print(importances)
# print(accuracy_score(y_test, pred))
# print((pred == y_test).mean())
# plot_importance(model)
# pyplot.show()

3.多标签任务

# conding: utf-8
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.datasets import make_multilabel_classification
from xgboost import XGBRFClassifier, XGBClassifier
x, y = make_multilabel_classification(n_samples=10000, n_labels=10, n_classes=40)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
clf_multilabel = OneVsRestClassifier(XGBClassifier(max_depth=7), n_jobs=-1)
clf_multilabel.fit(x_train, y_train)

pred = clf_multilabel.predict(x_test)
proba = clf_multilabel.predict_proba(x_test)
print(y_test[:5])
print(pred[:5])
print(proba[:5])

4. ranking任务

XGBoost 是原生支持 rank 的，只需要把 model参数中的 objective 设置为objective="rank:pairwise"

# coding: utf-8
# 结果排序（XGBoosting）
"""
输入数据：多组排序好的数据
输出模型：排序模型
"""
import pandas as pd
import numpy as np
from matplotlib import pyplot
from xgboost import DMatrix, train, plot_importance

xgb_rank_params1 = {
    'booster': 'gbtree',
    'eta': 0.1,
    'gamma': 1.0,
    'min_child_weight': 0.1,
    'objective': 'rank:pairwise',
    'eval_metric': 'merror',
    'max_depth': 6,
    'num_boost_round': 10,
    'save_period': 0
}

xgb_rank_params2 = {
    'bst:max_depth': 2,
    'bst:eta': 1, 'silent': 1,
    'objective': 'rank:pairwise',
    'nthread': 4,
    'eval_metric': 'ndcg'
}

# generate training dataset
# 一共2组*每组3条，6条样本，特征维数是2
n_group = 2
n_choice = 3
dtrain = np.random.uniform(0, 100, [n_group * n_choice, 2])
# print(dtrain)
"""
[[78.63843728 29.82138273]
 [46.52707336 46.01836555]
 [39.06578247 34.48339766]
 [ 4.44780208 25.41001231]
 [12.8137292  82.19043481]
 [92.84283164 99.34520591]]
 """
# numpy.random.choice(a, size=None, replace=True, p=None)
dtarget = np.array([np.random.choice([0, 1, 2], 3, False) for i in range(n_group)]).flatten()
# print(dtarget)  # [1 2 0 1 2 0]
# n_group用于表示从前到后每组各自有多少样本，前提是样本中各组是连续的，
# [3，3]表示一共6条样本中前3条是第一组，后3条是第二组

dgroup = np.array([n_choice for i in range(n_group)]).flatten()  # [[1,2], [3,4]]=>[1, 2, 3, 4]

# concate Train data, very import here !
xgbTrain = DMatrix(dtrain, label=dtarget)
xgbTrain.set_group(dgroup)

# generate eval data
dtrain_eval = np.random.uniform(0, 100, [n_group * n_choice, 2])
xgbTrain_eval = DMatrix(dtrain_eval, label=dtarget)
xgbTrain_eval.set_group(dgroup)
evallist = [(xgbTrain, 'train'), (xgbTrain_eval, 'eval')]

# train model
# xgb_rank_params1加上 evals 这个参数会报错，还没找到原因
# rankModel = train(xgb_rank_params1,xgbTrain,num_boost_round=10)
rankModel = train(xgb_rank_params2, xgbTrain, num_boost_round=20, evals=evallist)

# test dataset

dtest = np.random.uniform(0, 100, [n_group * n_choice, 2])
dtestgroup = np.array([n_choice for i in range(n_group)]).flatten()
xgbTest = DMatrix(dtest)
xgbTest.set_group(dgroup)

# test
res = rankModel.predict(xgbTest)
print(np.argsort(res))
plot_importance(rankModel)
pyplot.show()

posted on 2020-01-06 11:08 nnnnnnnnnnnnnnnn 阅读(735) 评论(0) 收藏举报

刷新页面返回顶部

nnnnnnnnnnnnnnnn

极度提升算法 xgboost

导航

公告