Sklearn管道:在ColumnTransformer中获取OneHotEncode之后的功能名称



我想在适应管道后获得功能名称。

categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])

numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])

然后

clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', GradientBoostingRegressor())])

在与pandas数据帧拟合后,我可以从中获得特征重要性

clf.steps[1][1].feature_importances_

我尝试了clf.steps[0][1].get_feature_names(),但出现错误

AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.

如何从中获取功能名称?

您可以使用以下代码段访问feature_name:

clf.named_steps['preprocessor'].transformers_[1][1]
.named_steps['onehot'].get_feature_names(categorical_features)

使用sklearn>=0.21版本,我们可以让它变得更简单:

clf['preprocessor'].transformers_[1][1]
['onehot'].get_feature_names(categorical_features)

可复制示例:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
'category': ['asdf', 'asfa', 'asdfas', 'as'],
'num1': [1, 1, 0, 0],
'target': [0.2, 0.11, 1.34, 1.123]})
numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor',  LinearRegression())])
clf.fit(df.drop('target', 1), df['target'])
clf.named_steps['preprocessor'].transformers_[1][1]
.named_steps['onehot'].get_feature_names(categorical_features)
# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
#  'category_asdf' 'category_asdfas' 'category_asfa']

Scikit Learn 1.0现在有了新的功能来跟踪功能名称。

from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# SimpleImputer does not have get_feature_names_out, so we need to add it
# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will
# have this method.
# g
SimpleImputer.get_feature_names_out = (lambda self, names=None:
self.feature_names_in_)
num_pipeline = make_pipeline(SimpleImputer(), StandardScaler())
transformer = make_column_transformer(
(num_pipeline, ["age", "height"]),
(OneHotEncoder(), ["city"]))
pipeline = make_pipeline(transformer, LinearRegression())

df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],
"age": [32, 65, 18, 24],
"height": [172, 163, 169, 190],
"weight": [65, 62, 54, 95]},
index=["Alice", "Bunji", "Cécile", "Dave"])

pipeline.fit(df, df["weight"])

## get pipeline feature names
pipeline[:-1].get_feature_names_out()

## specify feature names as your columns
pd.DataFrame(pipeline[:-1].transform(df),
columns=pipeline[:-1].get_feature_names_out(),
index=df.index)

EDIT:实际上Peter的注释答案在ColumnTransformer文档中:

变换后的特征矩阵中的列的顺序遵循转换器列表中指定列的顺序。原始特征矩阵中未指定的列将从生成的转换特征矩阵中删除,除非在passthrough关键字中指定。通过passthrough指定的那些列将添加到变压器输出的右侧。


要用Paul在评论中提出的问题来完成Venkatachalam的回答,ColumnTransformer.get_feature_names()方法中出现的功能名称的顺序取决于ColumnTransformer实例中steps变量的声明顺序。

我找不到任何文档,所以我只是玩了下面的玩具示例,这让我理解了其中的逻辑。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler
class testEstimator(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
return self
def transform(self,X):
return np.full(X.shape, self.string).reshape(-1,1)
def get_feature_names(self):
return self.string
transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)
dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)
for name,step in pipeline.named_steps.items():
if hasattr(step, 'get_feature_names'):
print(step.get_feature_names())

为了有一个更具代表性的例子,我添加了一个RobustScaler,并在管道上嵌套了ColumnTransformer。顺便说一句,你会发现我版本的Venkatachalam的方法来获得步骤的功能名称循环。你可以把它变成一个稍微更有用的变量,方法是用列表理解来打开名称:

[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]

因此,使用dt_test和估计器来了解特性名称是如何构建的,以及它是如何在get_feature_names()中连接的。这里是另一个变压器的例子,它使用输入列输出2列:

class testEstimator3(BaseEstimator,TransformerMixin):
def __init__(self,string):
self.string = string
def fit(self,X):
self.unique = np.unique(X)[0]
return self
def transform(self,X):
return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)
def get_feature_names(self):
return list((self.unique,self.string))
dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)
transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)
pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
if hasattr(step[1], 'get_feature_names'):
print(step[1].get_feature_names())

如果您正在寻找如何在连续管道之后访问列名,最后一个管道是ColumnTransformer,您可以按照以下示例访问它们:

full_pipeline中有两条管线genderrelevent_experience

full_pipeline = ColumnTransformer([
("gender", gender_encoder, ["gender"]),
("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])

gender管道如下所示:

gender_encoder = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
("cat", OneHotEncoder())
])

在拟合full_pipeline之后,可以使用以下代码段访问列名

full_pipeline.transformers_[0][1][1].get_feature_names_out() 

在我的案例中,输出是:array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)

您已经非常接近于完成这项任务。在你建立你的管道之后:

clf = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', DecisionTreeRegressor())])

clf适配到featurestarget变量上,如下所示:

clf.fit(features, target)

然后您应该能够访问OneHotEncoder:的功能名称

clf.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out()

相关内容

  • 没有找到相关文章

最新更新