极度提升算法 xgboost

xgboot是个人认为最全能的机器学习算法 核心思想是贪心策略+最优化(二次最优化) 背包问题 预测函数的loss最优

一、

1. 分类:信息增益、信息增益率、基尼系数来判定树的节点分裂  结果是类别

2.回归:预测误差,常用的有均方误差、对数误差等  结果是数值

 

二、bagging boosting

1.都是集成学习

2.前者各个节点独立 后者前节点影响后节点

3. 停止迭代  a .当引入的分裂带来的增益小于一个阀值的时候  b. 当树达到最大深度时则停止建立决策树 c. 当样本权重和小于设定阈值时则停止建树

 4. 

       1. objective[默认是reg:linear] 
  这个参数定义需要被最小化的损失函数。binary:logistic二分类的逻辑回归,返回预测的概率非类别。

                  multi:softmax使用softmax的多分类器,返回预测的类别。在这种情况下,你还要多设置一个参数:num_class类别数目。 
  2. eval_metric[默认值取决于objective参数的取之] 
  对于有效数据的度量方法。对于回归问题,默认值是rmse,对于分类问题,默认是error。

  典型值有:rmse均方根误差;mae平均绝对误差;logloss负对数似然函数值;

  error二分类错误率;merror多分类错误率;mlogloss多分类损失函数;auc曲线下面积。 
  3. seed[默认是0] 
    随机数的种子,设置它可以复现随机数据的结果,也可以用于调整参数。

三、例子

1. 回归任务

model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=False, objective='reg:gamma')

2. 多分类任务

a. 调用原生接口

# coding: utf-8
"""
and 前真返后
or 前真返前
"""
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pylab as plt
# %matplotlib inline
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.datasets.samples_generator import make_classification
def e(a):
c = a > 0.5 and 1 or 0
return c
# X为样本特征,y为样本类别输出, 共10000个样本,每个样本20个特征,
# 输出有2个类别,没有冗余特征,每个类别一个簇
x, y = make_classification(n_samples=10000, n_features=20, n_redundant=0,
n_clusters_per_class=1, n_classes=2, flip_y=0.1)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
xgbTrain = xgb.DMatrix(x_train, y_train)
xgbTest = xgb.DMatrix(x_test, y_test)
evallist = [(xgbTrain, 'train'), (xgbTest, 'eval')]
param = {'max_depth': 5, 'eta': 0.5, 'verbosity': 1, 'objective': 'binary:logistic'}
raw_model = xgb.train(param, xgbTrain, num_boost_round=20, evals=evallist)
pred = raw_model.predict(xgbTest)
print(pred) # [0.13218588 0.98173666 0.85143584 ... 0.2700139 0.04027747 0.10013721]
res = [e(p) for p in pred]
print(accuracy_score(xgbTest.get_label(), res))

b. 调用sklearn接口
# conding: utf-8
from matplotlib import pyplot
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, KFold
import xgboost as xgb
from xgboost import plot_importance
kf = KFold(5, True, 10)
x, y = make_classification(n_samples=10000, n_features=20, n_redundant=0,
n_clusters_per_class=1, n_classes=2, flip_y=0.1)

for train_index, test_index in kf.split(x):
print('训练集:{}, 测试集:{}'.format(train_index, test_index))
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

model = xgb.XGBClassifier(max_depth=5, learning_rate=0.5,
random_state=0)
model.fit(x_train, y_train, early_stopping_rounds=10, eval_metric="error",
eval_set=[(x_test, y_test)])
pred = model.predict(x_test)
importances = model.feature_importances_
model = SelectFromModel(model, threshold='{}*mean'.format(0.8), prefit=True)
# print(importances)
# print(accuracy_score(y_test, pred))
# print((pred == y_test).mean())
# plot_importance(model)
# pyplot.show()

3.多标签任务

# conding: utf-8
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.datasets import make_multilabel_classification
from xgboost import XGBRFClassifier, XGBClassifier
x, y = make_multilabel_classification(n_samples=10000, n_labels=10, n_classes=40)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
clf_multilabel = OneVsRestClassifier(XGBClassifier(max_depth=7), n_jobs=-1)
clf_multilabel.fit(x_train, y_train)

pred = clf_multilabel.predict(x_test)
proba = clf_multilabel.predict_proba(x_test)
print(y_test[:5])
print(pred[:5])
print(proba[:5])

4. ranking任务

XGBoost 是原生支持 rank 的,只需要把 model参数中的 objective 设置为objective="rank:pairwise" 

# coding: utf-8
# 结果排序(XGBoosting)
"""
输入数据:多组排序好的数据
输出模型:排序模型
"""
import pandas as pd
import numpy as np
from matplotlib import pyplot
from xgboost import DMatrix, train, plot_importance

xgb_rank_params1 = {
'booster': 'gbtree',
'eta': 0.1,
'gamma': 1.0,
'min_child_weight': 0.1,
'objective': 'rank:pairwise',
'eval_metric': 'merror',
'max_depth': 6,
'num_boost_round': 10,
'save_period': 0
}

xgb_rank_params2 = {
'bst:max_depth': 2,
'bst:eta': 1, 'silent': 1,
'objective': 'rank:pairwise',
'nthread': 4,
'eval_metric': 'ndcg'
}

# generate training dataset
# 一共2组*每组3条,6条样本,特征维数是2
n_group = 2
n_choice = 3
dtrain = np.random.uniform(0, 100, [n_group * n_choice, 2])
# print(dtrain)
"""
[[78.63843728 29.82138273]
[46.52707336 46.01836555]
[39.06578247 34.48339766]
[ 4.44780208 25.41001231]
[12.8137292 82.19043481]
[92.84283164 99.34520591]]
"""
# numpy.random.choice(a, size=None, replace=True, p=None)
dtarget = np.array([np.random.choice([0, 1, 2], 3, False) for i in range(n_group)]).flatten()
# print(dtarget) # [1 2 0 1 2 0]
# n_group用于表示从前到后每组各自有多少样本,前提是样本中各组是连续的,
# [3,3]表示一共6条样本中前3条是第一组,后3条是第二组

dgroup = np.array([n_choice for i in range(n_group)]).flatten() # [[1,2], [3,4]]=>[1, 2, 3, 4]

# concate Train data, very import here !
xgbTrain = DMatrix(dtrain, label=dtarget)
xgbTrain.set_group(dgroup)

# generate eval data
dtrain_eval = np.random.uniform(0, 100, [n_group * n_choice, 2])
xgbTrain_eval = DMatrix(dtrain_eval, label=dtarget)
xgbTrain_eval.set_group(dgroup)
evallist = [(xgbTrain, 'train'), (xgbTrain_eval, 'eval')]

# train model
# xgb_rank_params1加上 evals 这个参数会报错,还没找到原因
# rankModel = train(xgb_rank_params1,xgbTrain,num_boost_round=10)
rankModel = train(xgb_rank_params2, xgbTrain, num_boost_round=20, evals=evallist)

# test dataset

dtest = np.random.uniform(0, 100, [n_group * n_choice, 2])
dtestgroup = np.array([n_choice for i in range(n_group)]).flatten()
xgbTest = DMatrix(dtest)
xgbTest.set_group(dgroup)

# test
res = rankModel.predict(xgbTest)
print(np.argsort(res))
plot_importance(rankModel)
pyplot.show()

posted on 2020-01-06 11:08  nnnnnnnnnnnnnnnn  阅读(735)  评论(0)    收藏  举报

导航