学习笔记:SciKit-Learn

 

 

transformer

sklearn.preprocessing.MinMaxScaler()
sklearn.preprocessing.StandardScaler()
sklearn.preprocessing.OneHotEncoder()
sklearn.preprocessing.LabelEncoder()
sklearn.preprocessing.RobustScaler()

lablecoder直接将不同类别翻译为顺序数字编码
onhotencoder将一个特征的n_1类别翻译为长度为n_1的行向量, 形如[1, 0, ..., 0], 如果有多个特征, 拼接起来为长度为n_1+...+n_k的行向量

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd

x = [[1, 2, 3],
     [1, 1, 3],
     [2, 1, 3]]
x = pd.DataFrame(x, columns=[0, 1, 2])

#label
encoder_1 = LabelEncoder()
x_1 = encoder_1.fit_transform(x.iloc[:, 0])
print(x_1)

#onehot
encoder_2 = OneHotEncoder()
encoder_2.fit(x) # 先根据原数据训练(fit)
x_2 = encoder_2.transform([[2, 2, 3]]).toarray()
print(x_2) # x中第1列特征共2类, 1对应[1, 0], 2对应[0, 1] 

输出:

[0 0 1]
[[0. 1. 0. 1. 1.]]
OneHotEncoder(n_values=’auto’)  # 表示每个特征使用几维的数值由数据集自动推断,即几种类别就使用几位来表示
OneHotEncoder(n_values = [2, 3, 4])  # 表示指定每个特征使用的维数(用于训练集中有类别丢失的情况)

 

classifier

 

regressor

 

 

Pipeline

(1)串行化用法:通过steps参数,设定数据处理流程。格式为('key','value'),key是自己为这一step设定的名称,value是对应的处理类。最后通过list将这些step传入。前n-1个step中的类都必须有transform函数,最后一步可有可无,一般最后一步为模型。使用最简单的iris数据集来举例:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
pipe = Pipeline(steps=[('pca',PCA()),('svc',SVC())])
pipe.fit(iris.data, iris.target)

 

(2)通过make_pipeline函数实现:它是Pipeline类的简单实现,只需传入每个step的类实例即可,不需自己命名,自动将类的小写设为该step的名:

from sklearn.linear_model import Lasso
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
pipe = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

 

同时可以通过set_params重新设置每个类里边需传入的参数,设置方法为 set_param(step's name__parma=参数值) :

pipe.set_params(lasso__alpha=0.0001) # 将alpha从0.0005变成0.0001

 

 

特征选择(feature_selection)

# 方差阈值过滤
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=(0.1)) # 只选择方差>0.1的特征
selector.fit_transform(features)

# 递归消除特征(Recursive Feature Elimination)
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
lr = LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=4, step=1)  # n_features_to_select参数确定所需的特征数量
rfe.fit(X, y)

# 依据重要性选择特征
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge
ridge = Ridge().fit(X, y)
model = SelectFromModel(ridge, prefit=True, threshold='mean')
X_transformed = model.transform(X)

 

posted @ 2021-07-21 14:12  Raylan  阅读(100)  评论(0)    收藏  举报