1. 导入包
2. 导入数据集
3. 数据预处理
- 3.1 检测并处理缺失值
- 3.2 处理类别型变量
- 3.3 得到自变量和因变量
- 3.4 拆分训练集和测试集
- 3.5 特征缩放
4. 使用不同的参数构建随机森林模型
- 4.1 模型1：构建随机森林模型
  - 4.1.1 构建模型
  - 4.1.2 测试集做预测
  - 4.1.3 评估模型性能
  - 4.1.4 变量重要性排名
- 4.2 模型2：构建随机森林模型
- 4.3 模型3：构建随机森林模型

数据集链接：https://www.cnblogs.com/ojbtospark/p/16009324.html

1. 导入包

In [1]:

# 导入包
import numpy as np 
import pandas as pd 

2. 导入数据集

In [2]:

# 导入数据集
data = pd.read_csv("german_credit_data.csv")
data

Out[2]:

	NO.	Age	Sex	Job	Housing	Saving accounts	Checking account	Credit amount	Duration	Purpose	Risk
0	0	67	male	2	own	NaN	little	1169	6	radio/TV	good
1	1	22	female	2	own	little	moderate	5951	48	radio/TV	bad
2	2	49	male	1	own	little	NaN	2096	12	education	good
3	3	45	male	2	free	little	little	7882	42	furniture/equipment	good
4	4	53	male	2	free	little	little	4870	24	car	bad
...	...	...	...	...	...	...	...	...	...	...	...
995	995	31	female	1	own	little	NaN	1736	12	furniture/equipment	good
996	996	40	male	3	own	little	little	3857	30	car	good
997	997	38	male	2	own	little	NaN	804	12	radio/TV	good
998	998	23	male	2	free	little	little	1845	45	radio/TV	bad
999	999	27	male	2	own	moderate	moderate	4576	45	car	good

1000 rows × 11 columns

3. 数据预处理

3.1 检测并处理缺失值

In [3]:

# 检测缺失值
null_df = data.isnull().sum() # 检测缺失值
null_df

Out[3]:

NO.                   0
Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

In [4]:

# 处理Saving accounts 和 Checking account 这2个字段
for col in ['Saving accounts', 'Checking account']: # 处理缺失值
    data[col].fillna('none', inplace=True) # none说明这些人没有银行账户

In [5]:

# 检测缺失值
null_df = data.isnull().sum() 
null_df

Out[5]:

NO.                 0
Age                 0
Sex                 0
Job                 0
Housing             0
Saving accounts     0
Checking account    0
Credit amount       0
Duration            0
Purpose             0
Risk                0
dtype: int64

3.2 处理类别型变量

In [6]:

# 处理Job字段
print(data.dtypes)

NO.                  int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object

In [7]:

data['Job'] = data['Job'].astype('object')

In [8]:

# 处理类别型变量
data = pd.get_dummies(data, drop_first = True) 
data

Out[8]:

	NO.	Age	Credit amount	Duration	Sex_male	Job_1	Job_2	Job_3	Housing_own	Housing_rent	...	Checking account_none	Checking account_rich	Purpose_car	Purpose_domestic appliances	Purpose_education	Purpose_furniture/equipment	Purpose_radio/TV	Purpose_repairs	Purpose_vacation/others	Risk_good
0	0	67	1169	6	1	0	1	0	1	0	...	0	0	0	0	0	0	1	0	0	1
1	1	22	5951	48	0	0	1	0	1	0	...	0	0	0	0	0	0	1	0	0	0
2	2	49	2096	12	1	1	0	0	1	0	...	1	0	0	0	1	0	0	0	0	1
3	3	45	7882	42	1	0	1	0	0	0	...	0	0	0	0	0	1	0	0	0	1
4	4	53	4870	24	1	0	1	0	0	0	...	0	0	1	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	995	31	1736	12	0	1	0	0	1	0	...	1	0	0	0	0	1	0	0	0	1
996	996	40	3857	30	1	0	0	1	1	0	...	0	0	1	0	0	0	0	0	0	1
997	997	38	804	12	1	0	1	0	1	0	...	1	0	0	0	0	0	1	0	0	1
998	998	23	1845	45	1	0	1	0	0	0	...	0	0	0	0	0	0	1	0	0	0
999	999	27	4576	45	1	0	1	0	1	0	...	0	0	1	0	0	0	0	0	0	1

1000 rows × 25 columns

3.3 得到自变量和因变量

In [9]:

# 得到自变量和因变量
y = data['Risk_good'].values
data = data.drop(['Risk_good'], axis = 1)
x = data.values

3.4 拆分训练集和测试集

In [10]:

# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(800, 24)
(200, 24)
(800,)
(200,)

3.5 特征缩放

In [11]:

# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

4. 使用不同的参数构建随机森林模型

4.1 模型1：构建随机森林模型

4.1.1 构建模型

In [12]:

# 使用不同的参数构建随机森林模型
# 模型1：构建随机森林模型（max_depth=9, max_features='auto', min_samples_leaf=5, n_estimators=50）
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=9, max_features='auto', min_samples_leaf=5, n_estimators=50, random_state = 1)
classifier.fit(x_train, y_train)

Out[12]:

RandomForestClassifier(max_depth=9, min_samples_leaf=5, n_estimators=50,
                       random_state=1)

4.1.2 测试集做预测

In [13]:

# 在测试集做预测
y_pred = classifier.predict(x_test)

4.1.3 评估模型性能

In [14]:

# 评估模型性能
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_test, y_pred))

0.75

In [15]:

print(confusion_matrix(y_test, y_pred))

[[ 18  41]
 [  9 132]]

In [16]:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.67      0.31      0.42        59
           1       0.76      0.94      0.84       141

    accuracy                           0.75       200
   macro avg       0.71      0.62      0.63       200
weighted avg       0.73      0.75      0.72       200

4.1.4 变量重要性排名

In [17]:

# 得到变量重要性排名
importance = classifier.feature_importances_
import matplotlib.pyplot as plt
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [20]:

data.columns[2]

Out[20]:

'Credit amount'

通过变量重要性的柱状图可见，第0、1、2、3、15个自变量对因变量的影响较大。可以考虑做特征选择，进一步提升模型性能。特征选择在后面的章节会讲到。

4.2 模型2：构建随机森林模型

In [21]:

# 模型2：构建随机森林模型（max_depth=3, max_features='auto', min_samples_leaf=50, n_estimators=100）
classifier = RandomForestClassifier(max_depth=3, max_features='auto', min_samples_leaf=50, n_estimators=100, random_state = 1)
classifier.fit(x_train, y_train)

Out[21]:

RandomForestClassifier(max_depth=3, min_samples_leaf=50, random_state=1)

In [22]:

# 在测试集做预测
y_pred = classifier.predict(x_test)

In [23]:

# 评估模型性能
print(accuracy_score(y_test, y_pred))

0.705

In [24]:

print(confusion_matrix(y_test, y_pred))

[[  0  59]
 [  0 141]]

In [25]:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        59
           1       0.70      1.00      0.83       141

    accuracy                           0.70       200
   macro avg       0.35      0.50      0.41       200
weighted avg       0.50      0.70      0.58       200

D:\ProgramFiles\anaconda3\envs\tensorflow\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))

In [26]:

# 得到变量重要性排名
importance = classifier.feature_importances_
import matplotlib.pyplot as plt
plt.bar([x for x in range(len(importance))], importance)
plt.show()

4.3 模型3：构建随机森林模型

In [27]:

# 模型3：构建随机森林模型（max_depth=9, max_features=15, min_samples_leaf=5, n_estimators=25）
classifier = RandomForestClassifier(max_depth=9, max_features=15, min_samples_leaf=5, n_estimators=25, random_state = 1)
classifier.fit(x_train, y_train)

Out[27]:

RandomForestClassifier(max_depth=9, max_features=15, min_samples_leaf=5,
                       n_estimators=25, random_state=1)

In [28]:

# 在测试集做预测
y_pred = classifier.predict(x_test)

In [29]:

# 评估模型性能
print(accuracy_score(y_test, y_pred))

0.755

In [30]:

print(confusion_matrix(y_test, y_pred))

[[ 25  34]
 [ 15 126]]

In [31]:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.62      0.42      0.51        59
           1       0.79      0.89      0.84       141

    accuracy                           0.76       200
   macro avg       0.71      0.66      0.67       200
weighted avg       0.74      0.76      0.74       200

In [32]:

# 得到变量重要性排名
importance = classifier.feature_importances_
import matplotlib.pyplot as plt
plt.bar([x for x in range(len(importance))], importance)
plt.show()

结论：由上面3个模型可见，不同超参数对模型性能的影响不同

一不小心就进橘子了

橘子种植园

机器学习—回归与分类4-2（随机森林算法）

主要步骤流程：