机器学习—编程基础1

预处理黑色星期五数据（熟悉流程，后续文章有进行预测部分）

主要步骤流程：

1. 导入包和数据集
2. 处理缺失数据
3. 特征工程
4. 处理类别型字段
5. 得到自变量和因变量
6. 拆分训练集和测试集
7. 特征缩放

数据集链接：https://www.heywhale.com/mw/dataset/622f197b8a84f900178aa2c7/file

1. 导入包和数据集

In [2]:

# 导入包
import numpy as np
import pandas as pd

In [3]:

# 导入数据集
data = pd.read_csv('BlackFriday.csv')
data.head(5)

Out[3]:

	User_ID	Product_ID	Gender	Age	Occupation	City_Category	Stay_In_Current_City_Years	Marital_Status	Product_Category_1	Product_Category_2	Product_Category_3	Purchase
0	1000001	P00069042	F	0-17	10	A	2	0.0	3	NaN	NaN	8370
1	1000001	P00248942	F	0-17	10	A	2	0.0	1	6.0	14.0	15200
2	1000001	P00087842	F	0-17	10	A	2	NaN	12	NaN	NaN	1422
3	1000001	P00085442	F	0-17	10	A	2	0.0	12	14.0	NaN	1057
4	1000002	P00285442	M	55+	16	C	4+	0.0	8	NaN	NaN	7969

2. 处理缺失数据

In [4]:

# 处理缺失数据
# 检测缺失值
null_df = data.isnull().sum()
null_df

Out[4]:

User_ID                           0
Product_ID                        0
Gender                            0
Age                               0
Occupation                        0
City_Category                     0
Stay_In_Current_City_Years        0
Marital_Status                    3
Product_Category_1                0
Product_Category_2            15721
Product_Category_3            34817
Purchase                          0
dtype: int64

Marital_Status字段有3个缺失值；根据业务场景，缺失的默认是未婚，即用0填补。 Product_Category_2字段有15721个缺失值；根据业务场景，这个字段不重要，删除。 Product_Category_3字段有34817个缺失值；根据业务场景，这个字段不重要，删除。

In [5]:

# 删除2个缺失列
data = data.drop(['Product_Category_2', 'Product_Category_3'], axis = 1)

In [6]:

# 填补缺失列
data['Marital_Status'].fillna(0, inplace = True)

In [7]:

# 再次检测缺失值
null_df = data.isnull().sum()
null_df

Out[7]:

User_ID                       0
Product_ID                    0
Gender                        0
Age                           0
Occupation                    0
City_Category                 0
Stay_In_Current_City_Years    0
Marital_Status                0
Product_Category_1            0
Purchase                      0
dtype: int64

3. 特征工程

特征工程（Feature Engineering）是将原始数据转化成更好的表达问题本质的特征的过程

In [8]:

# 特征工程
# 删除无用的列
data = data.drop(['User_ID', 'Product_ID'], axis = 1)

In [9]:

# 处理Stay_In_Current_City_Years列
data['Stay_In_Current_City_Years'].replace('4+', 4, inplace = True)
data['Stay_In_Current_City_Years'] = data['Stay_In_Current_City_Years'].astype('int64')

4. 处理类别型字段

In [10]:

# 处理类别型字段
# 检查类别型变量
print(data.dtypes)

Gender                         object
Age                            object
Occupation                      int64
City_Category                  object
Stay_In_Current_City_Years      int64
Marital_Status                float64
Product_Category_1              int64
Purchase                        int64
dtype: object

根据业务场景，Occupation列、Marital_Status列和Product_Category_1列应该是类别型字段。需要转换。

In [11]:

# 转换变量类型
data['Product_Category_1'] = data['Product_Category_1'].astype('object')
data['Occupation'] = data['Occupation'].astype('object')
data['Marital_Status'] = data['Marital_Status'].astype('object')

In [12]:

# 检查类别型变量
print(data.dtypes)

Gender                        object
Age                           object
Occupation                    object
City_Category                 object
Stay_In_Current_City_Years     int64
Marital_Status                object
Product_Category_1            object
Purchase                       int64
dtype: object

In [13]:

# 字符编码&独热编码
data = pd.get_dummies(data, drop_first = True) 
data.head(5)

Out[13]:

	Stay_In_Current_City_Years	Purchase	Gender_M	Age_55+	...	Product_Category_1_12
0	2	8370	0	0	...	0
1	2	15200	0	0	...	0
2	2	1422	0	0	...	1
3	2	1057	0	0	...	1
4	4	7969	1	1	...	0

5 rows × 49 columns

5. 得到自变量和因变量

In [14]:

# 得到自变量和因变量
y = data['Purchase'].values
print(y.shape)
data = data.drop(['Purchase'], axis = 1)
x = data.values
print(x.shape)

(50000,)
(50000, 48)

6. 拆分训练集和测试集

In [15]:

# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 205)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(35000, 48)
(15000, 48)
(35000,)
(15000,)

自变量保存到x_train和x_test中，因变量保存到y_train和y_test中。

7. 特征缩放

In [16]:

y_train.shape

Out[16]:

(35000,)

In [17]:

a = y_train.reshape(-1, 1)
a.shape

Out[17]:

(35000, 1)

In [18]:

# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))

In [19]:

x_train[:3,:]

Out[19]:

array([[ 0.10343173, -1.77050054, -0.4873769 ,  1.24137799, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064, -0.84608185, -0.66227816,
        -0.83770833, -0.2140971 , -0.19340603, -0.14727185,  1.59734051,
        -0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ],
       [-1.43966733,  0.56481203,  2.05180016, -0.80555641, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963,  2.5951506 , -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035, -0.14868325, -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064,  1.18191875, -0.66227816,
        -0.83770833, -0.2140971 , -0.19340603, -0.14727185, -0.62604059,
        -0.19412034, -0.08274392,  1.93889522, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ],
       [-0.6681178 ,  0.56481203, -0.4873769 ,  1.24137799, -0.49566581,
        -0.2961413 , -0.27386969, -0.20083646, -0.30534824, -0.22524545,
        -0.18825963, -0.38533409, -0.1388486 , -0.18621612, -0.33838987,
        -0.05537615, -0.10670196, -0.16153035,  6.7257073 , -0.24336363,
        -0.12529891, -0.22657272, -0.1519627 , -0.2198004 , -0.28316115,
        -0.11100179, -0.12739959, -0.26453064,  1.18191875, -0.66227816,
         1.19373291, -0.2140971 , -0.19340603, -0.14727185,  1.59734051,
        -0.19412034, -0.08274392, -0.51575763, -0.02726553, -0.09620978,
        -0.21851165, -0.08396038, -0.09756185, -0.05216966, -0.10900638,
        -0.13139418, -0.03339953, -0.0726975 ]])

In [20]:

y_train[:3]

Out[20]:

array([-0.12211265, -1.46218147, -0.78499224])

缩放后的x_train和y_train，所有特征的值处于相似范围内。

结论：数据预处理有固定的方法。 Python提供了丰富的库，方便人们做数据预处理工作。最初的数据通过数据预处理生成了x_train、y_train、x_test、y_test。在下一章中，前2个变量将训练模型，后2个变量将评估模型。

posted @ 2022-03-15 16:28 Theext 阅读(138) 评论(0) 收藏举报

刷新页面返回顶部

一不小心就进橘子了

橘子种植园