机器学习—回归与分类4-2(随机森林算法)

使用随机森林预测德国人信贷风险

主要步骤流程:

  • 1. 导入包
  • 2. 导入数据集
  • 3. 数据预处理
    • 3.1 检测并处理缺失值
    • 3.2 处理类别型变量
    • 3.3 得到自变量和因变量
    • 3.4 拆分训练集和测试集
    • 3.5 特征缩放
  • 4. 使用不同的参数构建随机森林模型
    • 4.1 模型1:构建随机森林模型
      • 4.1.1 构建模型
      • 4.1.2 测试集做预测
      • 4.1.3 评估模型性能
      • 4.1.4 变量重要性排名
    • 4.2 模型2:构建随机森林模型
    • 4.3 模型3:构建随机森林模型

1. 导入包

In [1]:
# 导入包
import numpy as np 
import pandas as pd 
 

2. 导入数据集

In [2]:
# 导入数据集
data = pd.read_csv("german_credit_data.csv")
data
Out[2]:
 NO.AgeSexJobHousingSaving accountsChecking accountCredit amountDurationPurposeRisk
0 0 67 male 2 own NaN little 1169 6 radio/TV good
1 1 22 female 2 own little moderate 5951 48 radio/TV bad
2 2 49 male 1 own little NaN 2096 12 education good
3 3 45 male 2 free little little 7882 42 furniture/equipment good
4 4 53 male 2 free little little 4870 24 car bad
... ... ... ... ... ... ... ... ... ... ... ...
995 995 31 female 1 own little NaN 1736 12 furniture/equipment good
996 996 40 male 3 own little little 3857 30 car good
997 997 38 male 2 own little NaN 804 12 radio/TV good
998 998 23 male 2 free little little 1845 45 radio/TV bad
999 999 27 male 2 own moderate moderate 4576 45 car good

1000 rows × 11 columns

 

3. 数据预处理

3.1 检测并处理缺失值

In [3]:
# 检测缺失值
null_df = data.isnull().sum() # 检测缺失值
null_df
Out[3]:
NO.                   0
Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64
In [4]:
# 处理Saving accounts 和 Checking account 这2个字段
for col in ['Saving accounts', 'Checking account']: # 处理缺失值
    data[col].fillna('none', inplace=True) # none说明这些人没有银行账户
In [5]:
# 检测缺失值
null_df = data.isnull().sum() 
null_df
Out[5]:
NO.                 0
Age                 0
Sex                 0
Job                 0
Housing             0
Saving accounts     0
Checking account    0
Credit amount       0
Duration            0
Purpose             0
Risk                0
dtype: int64

3.2 处理类别型变量

In [6]:
# 处理Job字段
print(data.dtypes)
NO.                  int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object
In [7]:
data['Job'] = data['Job'].astype('object')
In [8]:
# 处理类别型变量
data = pd.get_dummies(data, drop_first = True) 
data
Out[8]:
 NO.AgeCredit amountDurationSex_maleJob_1Job_2Job_3Housing_ownHousing_rent...Checking account_noneChecking account_richPurpose_carPurpose_domestic appliancesPurpose_educationPurpose_furniture/equipmentPurpose_radio/TVPurpose_repairsPurpose_vacation/othersRisk_good
0 0 67 1169 6 1 0 1 0 1 0 ... 0 0 0 0 0 0 1 0 0 1
1 1 22 5951 48 0 0 1 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
2 2 49 2096 12 1 1 0 0 1 0 ... 1 0 0 0 1 0 0 0 0 1
3 3 45 7882 42 1 0 1 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
4 4 53 4870 24 1 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 995 31 1736 12 0 1 0 0 1 0 ... 1 0 0 0 0 1 0 0 0 1
996 996 40 3857 30 1 0 0 1 1 0 ... 0 0 1 0 0 0 0 0 0 1
997 997 38 804 12 1 0 1 0 1 0 ... 1 0 0 0 0 0 1 0 0 1
998 998 23 1845 45 1 0 1 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
999 999 27 4576 45 1 0 1 0 1 0 ... 0 0 1 0 0 0 0 0 0 1

1000 rows × 25 columns

3.3 得到自变量和因变量

In [9]:
# 得到自变量和因变量
y = data['Risk_good'].values
data = data.drop(['Risk_good'], axis = 1)
x = data.values

3.4 拆分训练集和测试集

In [10]:
# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(800, 24)
(200, 24)
(800,)
(200,)

3.5 特征缩放

In [11]:
# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

 

4. 使用不同的参数构建随机森林模型

4.1 模型1:构建随机森林模型

4.1.1 构建模型

In [12]:
# 使用不同的参数构建随机森林模型
# 模型1:构建随机森林模型(max_depth=9, max_features='auto', min_samples_leaf=5, n_estimators=50)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=9, max_features='auto', min_samples_leaf=5, n_estimators=50, random_state = 1)
classifier.fit(x_train, y_train)
Out[12]:
RandomForestClassifier(max_depth=9, min_samples_leaf=5, n_estimators=50,
                       random_state=1)

4.1.2 测试集做预测

In [13]:
# 在测试集做预测
y_pred = classifier.predict(x_test)

4.1.3 评估模型性能

In [14]:
# 评估模型性能
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_test, y_pred))
0.75
In [15]:
print(confusion_matrix(y_test, y_pred))
[[ 18  41]
 [  9 132]]
In [16]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.67      0.31      0.42        59
           1       0.76      0.94      0.84       141

    accuracy                           0.75       200
   macro avg       0.71      0.62      0.63       200
weighted avg       0.73      0.75      0.72       200

4.1.4 变量重要性排名

In [17]:
# 得到变量重要性排名
importance = classifier.feature_importances_
import matplotlib.pyplot as plt
plt.bar([x for x in range(len(importance))], importance)
plt.show()
In [20]:
data.columns[2]
Out[20]:
'Credit amount'

通过变量重要性的柱状图可见,第0、1、2、3、15个自变量对因变量的影响较大。可以考虑做特征选择,进一步提升模型性能。特征选择在后面的章节会讲到。

4.2 模型2:构建随机森林模型

In [21]:
# 模型2:构建随机森林模型(max_depth=3, max_features='auto', min_samples_leaf=50, n_estimators=100)
classifier = RandomForestClassifier(max_depth=3, max_features='auto', min_samples_leaf=50, n_estimators=100, random_state = 1)
classifier.fit(x_train, y_train)
Out[21]:
RandomForestClassifier(max_depth=3, min_samples_leaf=50, random_state=1)
In [22]:
# 在测试集做预测
y_pred = classifier.predict(x_test)
In [23]:
# 评估模型性能
print(accuracy_score(y_test, y_pred))
0.705
In [24]:
print(confusion_matrix(y_test, y_pred))
[[  0  59]
 [  0 141]]
In [25]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        59
           1       0.70      1.00      0.83       141

    accuracy                           0.70       200
   macro avg       0.35      0.50      0.41       200
weighted avg       0.50      0.70      0.58       200
D:\ProgramFiles\anaconda3\envs\tensorflow\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
In [26]:
# 得到变量重要性排名
importance = classifier.feature_importances_
import matplotlib.pyplot as plt
plt.bar([x for x in range(len(importance))], importance)
plt.show()

4.3 模型3:构建随机森林模型

In [27]:
# 模型3:构建随机森林模型(max_depth=9, max_features=15, min_samples_leaf=5, n_estimators=25)
classifier = RandomForestClassifier(max_depth=9, max_features=15, min_samples_leaf=5, n_estimators=25, random_state = 1)
classifier.fit(x_train, y_train)
Out[27]:
RandomForestClassifier(max_depth=9, max_features=15, min_samples_leaf=5,
                       n_estimators=25, random_state=1)
In [28]:
# 在测试集做预测
y_pred = classifier.predict(x_test)
In [29]:
# 评估模型性能
print(accuracy_score(y_test, y_pred))
0.755
In [30]:
print(confusion_matrix(y_test, y_pred))
[[ 25  34]
 [ 15 126]]
In [31]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.62      0.42      0.51        59
           1       0.79      0.89      0.84       141

    accuracy                           0.76       200
   macro avg       0.71      0.66      0.67       200
weighted avg       0.74      0.76      0.74       200
In [32]:
# 得到变量重要性排名
importance = classifier.feature_importances_
import matplotlib.pyplot as plt
plt.bar([x for x in range(len(importance))], importance)
plt.show()
 

结论: 由上面3个模型可见,不同超参数对模型性能的影响不同

 

posted @ 2022-03-15 17:33  Theext  阅读(379)  评论(0)    收藏  举报