1. 导入包
2. 导入数据集
3. 数据预处理
- 3.1 检测并处理缺失值
- 3.2 删除无用的列
- 3.3 检查类别型变量
- 3.4 标签编码&独热编码
- 3.5 得到自变量和因变量
- 3.6 拆分训练集和测试集
- 3.7 特征缩放
4. 使用不同的参数构建AdaBoost回归模型
- 4.1 模型1：构建AdaBoost回归模型
  - 4.1.1 构建模型
  - 4.1.2 测试集做预测
  - 4.1.3 评估模型性能
- 4.2 模型2：构建AdaBoost回归模型

数据集链接：https://www.cnblogs.com/ojbtospark/p/16005660.html

1. 导入包

In [2]:

# 导入包
import numpy as np
import pandas as pd

2. 导入数据集

In [3]:

# 导入数据集
data = pd.read_csv('BlackFriday.csv')
data

Out[3]:

	User_ID	Product_ID	Gender	Age	Occupation	City_Category	Stay_In_Current_City_Years	Marital_Status	Product_Category_1	Product_Category_2	Product_Category_3	Purchase
0	1000001	P00069042	F	0-17	10	A	2	0	3	NaN	NaN	8370
1	1000001	P00248942	F	0-17	10	A	2	0	1	6.0	14.0	15200
2	1000001	P00087842	F	0-17	10	A	2	0	12	NaN	NaN	1422
3	1000001	P00085442	F	0-17	10	A	2	0	12	14.0	NaN	1057
4	1000002	P00285442	M	55+	16	C	4+	0	8	NaN	NaN	7969
...	...	...	...	...	...	...	...	...	...	...	...	...
49995	1001649	P00102642	M	18-25	19	C	2	1	4	8.0	9.0	1374
49996	1001649	P00035842	M	18-25	19	C	2	1	5	6.0	9.0	5372
49997	1001649	P00052842	M	18-25	19	C	2	1	10	15.0	NaN	18879
49998	1001649	P00183142	M	18-25	19	C	2	1	15	NaN	NaN	17029
49999	1001650	P00155642	M	26-35	19	C	1	0	8	NaN	NaN	6093

50000 rows × 12 columns

3. 数据预处理

3.1 检测并处理缺失值

In [4]:

# 检测缺失值
null_df = data.isnull().sum()
null_df

Out[4]:

User_ID                           0
Product_ID                        0
Gender                            0
Age                               0
Occupation                        0
City_Category                     0
Stay_In_Current_City_Years        0
Marital_Status                    0
Product_Category_1                0
Product_Category_2            15721
Product_Category_3            34817
Purchase                          0
dtype: int64

In [5]:

# 删除缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)

In [6]:

# 再次检测缺失值
null_df = data.isnull().sum()
null_df

Out[6]:

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Purchase                      0
dtype: int64

3.2 删除无用的列

In [7]:

# 删除无用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)

3.3 检查类别型变量

In [8]:

# 检查类别型变量
print(data.dtypes)

Gender                        object
Age                           object
Occupation                     int64
City_Category                 object
Stay_In_Current_City_Years    object
Marital_Status                 int64
Product_Category_1             int64
Purchase                       int64
dtype: object

In [9]:

# 转换变量类型
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')

In [10]:

# 检查类别型变量
print(data.dtypes)

Gender                        object
Age                           object
Occupation                    object
City_Category                 object
Stay_In_Current_City_Years     int64
Marital_Status                object
Product_Category_1            object
Purchase                       int64
dtype: object

3.4 标签编码&独热编码

In [11]:

# 标签编码&独热编码
data = pd.get_dummies(data, drop_first = True)
data

Out[11]:

	Stay_In_Current_City_Years	Purchase	Gender_M	Age_18-25	Age_26-35	Age_36-45	Age_46-50	Age_51-55	Age_55+	Occupation_1	...	Product_Category_1_9	Product_Category_1_10	Product_Category_1_11	Product_Category_1_12	Product_Category_1_13	Product_Category_1_14	Product_Category_1_15	Product_Category_1_16	Product_Category_1_17	Product_Category_1_18
0	2	8370	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	2	15200	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	2	1422	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
3	2	1057	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
4	4	7969	1	0	0	0	0	0	1	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2	1374	1	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49996	2	5372	1	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
49997	2	18879	1	1	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
49998	2	17029	1	1	0	0	0	0	0	0	...	0	0	0	0	0	0	1	0	0	0
49999	1	6093	1	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

50000 rows × 49 columns

3.5 得到自变量和因变量

In [12]:

# 得到自变量和因变量
y = data['Purchase'].values
data = data.drop(['Purchase'], axis = 1)
x = data.values

3.6 拆分训练集和测试集

In [13]:

# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(35000, 48)
(15000, 48)
(35000,)
(15000,)

3.7 特征缩放

In [14]:

# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))

4. 使用不同的参数构建AdaBoost回归模型

4.1 模型1：构建AdaBoost回归模型

4.1.1 构建模型

In [15]:

# 使用不同的参数构建AdaBoost回归模型
# 模型1：构建AdaBoost回归模型（base_estimator=None, n_estimators=50, learning_rate=1）
from sklearn.ensemble import AdaBoostRegressor
regressor = AdaBoostRegressor(n_estimators=50, learning_rate=1, loss='linear', random_state=0)
regressor.fit(x_train, y_train)

Out[15]:

AdaBoostRegressor(learning_rate=1, random_state=0)

4.1.2 测试集做预测

In [16]:

# 在测试集做预测
y_pred = regressor.predict(x_test)
y_pred[:5]

Out[16]:

array([ 0.65004612,  0.18805047,  0.55246384,  0.55246384, -0.01299079])

In [17]:

# y_pred变回特征缩放之前的
y_pred = sc_y.inverse_transform(y_pred)
y_pred[:5]

Out[17]:

array([12501.03102111, 10206.94045722, 12016.4753753 , 12016.4753753 ,
        9208.64781524])

4.1.3 评估模型性能

In [18]:

# 评估模型性能
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)

R2 Score: 0.3075923325444474

4.2 模型2：构建AdaBoost回归模型

In [29]:

# 模型2：构建AdaBoost回归模型（base_estimator=DecisionTreeRegressor, n_estimators=2000, learning_rate=0.1）
from sklearn.tree import DecisionTreeRegressor
regressor = AdaBoostRegressor(base_estimator = DecisionTreeRegressor(min_samples_split=100, max_depth=10, min_samples_leaf=10), 
                              n_estimators=1000, learning_rate=0.2, loss='linear', random_state=0)
regressor.fit(x_train, y_train)

Out[29]:

AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=10,
                                                       min_samples_leaf=10,
                                                       min_samples_split=100),
                  learning_rate=0.2, n_estimators=1000, random_state=0)

In [30]:

# 在测试集做预测
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred变回特征缩放之前的

In [31]:

# 评估模型性能
r2 = r2_score(y_test, y_pred)
print("R2 Score:", r2)

R2 Score: 0.6070474824774648

结论：由上面2个模型可见，不同超参数对模型性能的影响不同

一不小心就进橘子了

橘子种植园

机器学习—回归与分类4-3（AdaBoost算法）

主要步骤流程：