python 数据预处理之 MinMaxScaler 可视化体验
数据预处理时进行特征值的放缩,应该在训练集合测试集上进行相同的放缩,换言之放缩的标准都应该是在测试集上学习到的。
下面展示最大最下放缩的效果:
from matplotlib.pyplot import as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# 构造数据
X, _ = make_blobs(n_samples=60,
centers=5,
random_state=7,
cluster_std=2
)
X_train, X_test = train_test_split(X,
random_state=9,
test_size=0.1
)
fig, axes = plt.subplots(nrows=1,
ncols=3,
figsize=(13, 4)
)
# 绘制未经放缩的数据的训练集和测试集
axes[0].scatter(X_train[:, 0],
X_train[:, 1],
c='b',
label='Training set',
s=60
)
axes[0].scatter(X_test[:, 0],
X_test[:, 1],
marker='^',
c='r',
label='Test set',
s=60
)
axes[0].legend(loc=1)
axes[0].set_title('Original Data')
# 利用 MinMaxScaler 放缩数据
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 可视化正确放缩的数据
axes[1].scatter(X_train_scaled[:, 0],
X_train_scaled[:, 1],
c='b',
label='Training set',
s=60
)
axes[1].scatter(X_test_scaled[:, 0],
X_test_scaled[:, 1],
marker='^',
c='r',
label='Test set',
s=60
)
axes[1].set_title('Scaled Data')
# 对测试集进行单独放缩
test_scaler = MinMaxScaler()
test_scaler.fit(X_test)
X_test_scaled_badly = test_scaler.transform(X_test)
# 可视化错误放缩的数据
axes[2].scatter(X_train_scaled[:, 0],
X_train_scaled[:, 1],
c='b',
label='Training set',
s=60
)
axes[2].scatter(X_test_scaled_badly[:, 0],
X_test_scaled_badly[:, 1],
marker='^',
c='r',
label='Test set',
s=60
)
axes[2].set_title('Improperly Scaled Data')
# 为每幅图添加坐标轴标题
for ax in axes:
ax.set_xlabel('Feature 0')
ax.set_ylabel('Feature 1')
plt.show()
前两张图看起来一样,但是坐标刻度发生了变化,这便是特征放缩的效果。第三张图是对测试集的放缩是错误的,人为改变了数据的排列。
非学无以广才,非志无以成学。