sklearn.compose.ColumnTransformer的使用
class sklearn.compose.ColumnTransformer(transformers, *, remainder='drop', sparse_threshold=0.3, n_jobs=None, transformer_weights=None, verbose=False, verbose_feature_names_out=True)
看起来有很多参数,好像很难学会,但是如果只是基本的使用,还是很简单的。
毕竟,你看它的参数表中只有第一个参数transformers
是必填项,其他都是有默认参数的,可以不用填写。
transformers:list of tuples
List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data.
transformers:一个由元组组成的列表。每个元组由三个部分组成(name,transformer,columns)。
- name:str:Like in Pipeline and FeatureUnion, this allows the transformer and its parameters to be set using set_params and searched in grid search.
不是很理解上面的意思,不过使用的时候自己命个名即可。
- transformer:{‘drop’, ‘passthrough’} or estimator:
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.
不编了,我看不懂
- columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable:
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector
你的transformer要处理的列的列名
举个例子:
参考链接:https://www.kaggle.com/code/alexisbcook/pipelines
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])