学习笔记:SciKit-Learn
transformer
sklearn.preprocessing.MinMaxScaler()
sklearn.preprocessing.StandardScaler()
sklearn.preprocessing.OneHotEncoder()
sklearn.preprocessing.LabelEncoder()
sklearn.preprocessing.RobustScaler()
lablecoder直接将不同类别翻译为顺序数字编码
onhotencoder将一个特征的n_1类别翻译为长度为n_1的行向量, 形如[1, 0, ..., 0], 如果有多个特征, 拼接起来为长度为n_1+...+n_k的行向量
from sklearn.preprocessing import LabelEncoder, OneHotEncoder import pandas as pd x = [[1, 2, 3], [1, 1, 3], [2, 1, 3]] x = pd.DataFrame(x, columns=[0, 1, 2]) #label encoder_1 = LabelEncoder() x_1 = encoder_1.fit_transform(x.iloc[:, 0]) print(x_1) #onehot encoder_2 = OneHotEncoder() encoder_2.fit(x) # 先根据原数据训练(fit) x_2 = encoder_2.transform([[2, 2, 3]]).toarray() print(x_2) # x中第1列特征共2类, 1对应[1, 0], 2对应[0, 1]
输出:
[0 0 1] [[0. 1. 0. 1. 1.]]
OneHotEncoder(n_values=’auto’) # 表示每个特征使用几维的数值由数据集自动推断,即几种类别就使用几位来表示
OneHotEncoder(n_values = [2, 3, 4]) # 表示指定每个特征使用的维数(用于训练集中有类别丢失的情况)
classifier
regressor
Pipeline
(1)串行化用法:通过steps参数,设定数据处理流程。格式为('key','value'),key是自己为这一step设定的名称,value是对应的处理类。最后通过list将这些step传入。前n-1个step中的类都必须有transform函数,最后一步可有可无,一般最后一步为模型。使用最简单的iris数据集来举例:
from sklearn.pipeline import Pipeline from sklearn.svm import SVC from sklearn.decomposition import PCA from sklearn.datasets import load_iris iris = load_iris() pipe = Pipeline(steps=[('pca',PCA()),('svc',SVC())]) pipe.fit(iris.data, iris.target)
(2)通过make_pipeline函数实现:它是Pipeline类的简单实现,只需传入每个step的类实例即可,不需自己命名,自动将类的小写设为该step的名:
from sklearn.linear_model import Lasso from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler pipe = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
同时可以通过set_params重新设置每个类里边需传入的参数,设置方法为 set_param(step's name__parma=参数值) :
pipe.set_params(lasso__alpha=0.0001) # 将alpha从0.0005变成0.0001
特征选择(feature_selection)
# 方差阈值过滤 from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=(0.1)) # 只选择方差>0.1的特征 selector.fit_transform(features) # 递归消除特征(Recursive Feature Elimination) from sklearn.linear_model import LinearRegression from sklearn.feature_selection import RFE lr = LinearRegression() rfe = RFE(estimator=lr, n_features_to_select=4, step=1) # n_features_to_select参数确定所需的特征数量 rfe.fit(X, y) # 依据重要性选择特征 from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import Ridge ridge = Ridge().fit(X, y) model = SelectFromModel(ridge, prefit=True, threshold='mean') X_transformed = model.transform(X)

浙公网安备 33010602011771号