随机森林RandomForest&梯度提升决策树GBDT

模型亮点

随机森林，初始测试集上评分为0.53，调参后测试集上评分为0.85
梯度提升决策树，初始测试集上评分为0.56，调参后测试集上评分为0.88

-----------------------------------------以下为模型具体实现-----------------------------------------

Step1.数据读取

import pandas as pd
df=pd.read_csv('bankpep.csv',index_col=0)
df.head()

Step2.数据清洗

why 数据清洗？how 数据清洗？

随机森林不要求归一化
分类数据，用0-1表示
顺序数据（这里严格来说不算），用1、2、3、4表示

# 1）性别：female->1，male->0
df.loc[df['sex']=='FEMALE','sex']=1
df.loc[df['sex']=='MALE','sex']=0

# 2）婚否、车、储蓄行为、目前行为、是否按揭、是否接受提议：yes->1，no->0
col=['married','car','save_act','current_act','mortgage','pep']
for i in col:
    df.loc[df[i]=='YES',i]=1
    df.loc[df[i]=='NO',i]=0

# 3）地区：rural->1，suburan->2，town->3，inner_city->4
df.loc[df['region']=='RURAL','region']=1
df.loc[df['region']=='SUBURBAN','region']=2
df.loc[df['region']=='TOWN','region']=3
df.loc[df['region']=='INNER_CITY','region']=4

Step3.划分训练集和测试集

why 划分训练集和测试集？

把所有样本当作训练集，做过的题都是旧题，都会~
把部分样本当作训练集，在新题上做测试，起到检测效果~

from sklearn.model_selection import train_test_split
x=(df.drop('pep',axis=1)).astype(float)
y=df['pep'].astype(float)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)

Step4.启动随机森林分类器

how 启动分类器？

设定初始化参数
拟合训练集数据

from sklearn.ensemble import RandomForestClassifier
model1=RandomForestClassifier(n_estimators=1,max_depth=1) #初始参数设定
model1.fit(x_train,y_train)

Step5.模型评价-随机森林

what 模型评估？

分类问题，model.score->accuracy_score
回归问题，model.score->r2_score

def train_score(model,x_train,y_train):
    print("训练集上评分：",round(model.score(x_train,y_train),2))
def test_score(model,x_test,y_test):
    print("测试集上评分：",round(model.score(x_test,y_test),2))
print("-----随机森林-----")
train_score(model1,x_train,y_train)
test_score(model1,x_test,y_test)

Step6.优化参数-随机森林

why 优化参数？how 优化参数？

以提高测试集上评分
参数n_estimators、max_depth，对分类效果影响较大

from sklearn.model_selection import RandomizedSearchCV
def randomizedsearch(model,params,x_train,y_train):
    model=RandomizedSearchCV(model,params,n_iter=100,cv=10)
    model.fit(x_train,y_train) #再次用新参数拟合
    print("最优评分：",round(model.best_score_,2))
    print("最优参数：",model.best_params_)
    return model
print("-----随机森林-----")
params1={'n_estimators':range(1,21),'max_depth':range(1,21)} #注意，参数要加引号
model1=randomizedsearch(model1,params1,x_train,y_train)

Step7.模型保存-随机森林

why 保存模型？how 保存模型？

for job-lib（工作自由）~
dump转存模型，以pkl格式
load加载模型

from sklearn.externals import joblib
joblib.dump(model1,'d:\gbdt.pkl')
new_model1=joblib.load('d:\gbdt.pkl')
print("-----随机森林-----")
print("测试集上预测结果：\n",new_model1.predict(x_test))

Step8.启动梯度提升决策树

how 启动决策树？

设定初始化参数
拟合训练集数据

from sklearn.ensemble import GradientBoostingClassifier
model2=GradientBoostingClassifier(learning_rate=0.01,n_estimators=1,max_depth=1) #初始参数设定
model2.fit(x_train,y_train)

Step9.模型评价-梯度提升决策树

what 模型评估？

分类问题，model.score->accuracy_score
回归问题，model.score->r2_score

print("-----梯度提升决策树-----")
train_score(model2,x_train,y_train)
test_score(model2,x_test,y_test)

Step10.优化参数-梯度提升决策树

why 优化参数？how 优化参数？

以提高测试集上评分
参数learning_rate、n_estimators、max_depth，对分类效果影响较大

import numpy as np
params2={'learning_rate':np.arange(0.01,0.3,0.1),'n_estimators':range(1,21),'max_depth':range(1,21)}
model2=randomizedsearch(model2,params2,x_train,y_train)

Step11.模型保存-梯度提升决策树

why 保存模型？how 保存模型？

for job-lib（工作自由）~
dump转存模型，以pkl格式
load加载模型

from sklearn.externals import joblib
joblib.dump(model2,'d:\gbdt.pkl')
new_model2=joblib.load('d:\gbdt.pkl')
print("-----梯度提升决策树-----")
print("测试集上预测结果：\n",new_model2.predict(x_test))

-END

posted @ 2023-06-12 20:50 找回那所有、阅读(213) 评论(0) 收藏举报

刷新页面返回顶部

我的学习记录

Up to now：牛客155、Leetcode121 pass，DS学习中...

随机森林RandomForest&梯度提升决策树GBDT

公告