sklearn.compose.ColumnTransformer的使用

参考链接: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer

class sklearn.compose.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)

看起来有很多参数,好像很难学会,但是如果只是基本的使用,还是很简单的。
毕竟,你看它的参数表中只有第一个参数transformers是必填项,其他都是有默认参数的,可以不用填写。

transformers:list of tuples

List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.

transformers:一个由元组组成的列表。每个元组由三个部分组成(name,transformer,columns)。

  • name:str:Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.

不是很理解上面的意思,不过使用的时候自己命个名即可。

  • transformer:{‘drop’, ‘passthrough’} or estimator:
    Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

不编了,我看不懂

  • columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable:
    Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector

你的transformer要处理的列的列名

举个例子:
参考链接:https://www.kaggle.com/code/alexisbcook/pipelines

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])
posted @ 2022-07-21 17:17  请去看诡秘之主  阅读(343)  评论(0)    收藏  举报