python 数据预处理之 MinMaxScaler 可视化体验

数据预处理时进行特征值的放缩,应该在训练集合测试集上进行相同的放缩,换言之放缩的标准都应该是在测试集上学习到的。

下面展示最大最下放缩的效果:

from matplotlib.pyplot import as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# 构造数据
X, _ = make_blobs(n_samples=60, 
                  centers=5,
                  random_state=7,
                  cluster_std=2
                 )

X_train, X_test = train_test_split(X, 
                                   random_state=9,
                                   test_size=0.1
                                  )


fig, axes = plt.subplots(nrows=1,
                         ncols=3,
                         figsize=(13, 4)
                        )

# 绘制未经放缩的数据的训练集和测试集
axes[0].scatter(X_train[:, 0],
                X_train[:, 1],
                c='b',
                label='Training set',
                s=60
               )

axes[0].scatter(X_test[:, 0],
                X_test[:, 1],
                marker='^',
                c='r', 
                label='Test set',
                s=60
               )

axes[0].legend(loc=1)
axes[0].set_title('Original Data')

# 利用 MinMaxScaler 放缩数据
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 可视化正确放缩的数据
axes[1].scatter(X_train_scaled[:, 0],
                X_train_scaled[:, 1],
                c='b',
                label='Training set',
                s=60
               )

axes[1].scatter(X_test_scaled[:, 0],
                X_test_scaled[:, 1],
                marker='^',
                c='r',
                label='Test set',
                s=60
               )

axes[1].set_title('Scaled Data')

# 对测试集进行单独放缩
test_scaler = MinMaxScaler()
test_scaler.fit(X_test)
X_test_scaled_badly = test_scaler.transform(X_test)

# 可视化错误放缩的数据
axes[2].scatter(X_train_scaled[:, 0],
                X_train_scaled[:, 1],
                c='b',
                label='Training set',
                s=60
               )

axes[2].scatter(X_test_scaled_badly[:, 0],
                X_test_scaled_badly[:, 1],
                marker='^',
                c='r',
                label='Test set',
                s=60
               )

axes[2].set_title('Improperly Scaled Data') 

# 为每幅图添加坐标轴标题
for ax in axes:
    ax.set_xlabel('Feature 0')
    ax.set_ylabel('Feature 1')
    
plt.show()

 

 

 

 

前两张图看起来一样,但是坐标刻度发生了变化,这便是特征放缩的效果。第三张图是对测试集的放缩是错误的,人为改变了数据的排列。

 

 

posted @ 2019-11-20 22:56  赏尔  阅读(5459)  评论(0)    收藏  举报