LeaveOneOutEncoder in sklearn.pipeline



我用LeaveOneOutEncoder制作了一个管道。当然,我用一个玩具的例子。Leave One Out 用于转换分类变量

import pandas as pd
import numpy as np
from sklearn import preprocessing
import sklearn
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from category_encoders import  LeaveOneOutEncoder
from sklearn import linear_model
from sklearn.base import BaseEstimator, TransformerMixin
df= pd.DataFrame({ 'y': [1,2,3,4,5,6,7,8], 'a': ['a', 'b','a', 'b','a', 'b','a', 'b' ], 'b': [5,5,3,4,8,6,7,3],})
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
class MyLEncoder(BaseEstimator, TransformerMixin):
def transform(self, X, **fit_params):
enc = LeaveOneOutEncoder()
encc = enc.fit(np.asarray(X), y)
enc_data = encc.transform(np.asarray(X))
return enc_data
def fit_transform(self, X,y=None,  **fit_params):
self.fit(X,y,  **fit_params)
return self.transform(X)
def fit(self, X, y, **fit_params):
return self

X = df[['a', 'b']]
y = df['y']
regressor = linear_model.SGDRegressor()
pipeline = Pipeline([
# Use FeatureUnion to combine the features
('union', FeatureUnion(
transformer_list=[

# categorical
('categorical', Pipeline([
('selector', ItemSelector(key='a')),
('one_hot', MyLEncoder())
])),
# year
])),
# Use a regression
('model_fitting', linear_model.SGDRegressor()),
])
pipeline.fit(X, y)
pipeline.predict(X)

这就是我在训练和测试数据中使用它的全部正确!但是当我尝试预测新数据时,我得到一个 erorr

pipeline.predict(pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],}))

帮助找到错误!错误一定很简单,但我的眼睛在游动。问题一定出在MyLEncoder类上。我必须更改什么?

你正在打电话

encc = enc.fit(np.asarray(X), y)

MyLEncodertransform()方法中.

所以这里有几个问题:

1(您的LeaveOneOutEncoder只记住传递给MyLEncodertransform的最后数据,而忘记了之前的数据。

2(在装配过程中LeaveOneOutEncoder需要y在场。但这在预测期间不会出现,当调用MyLEncodertransform()时。

3(目前您的生产线:

pipeline.predict(X)

只是靠运气工作,因为你的X是相同的,当调用MyLEncodertransform()时,你已经定义了y所以它被使用。但这是错误的。

4(一个不相关的事情(可能不会称之为错误(。执行此操作时:

pipeline.predict(pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],}))

pipeline.predict()只需要X,不需要y。但你也在发送y。目前这不是问题,因为在管道中您只使用a列并丢弃所有信息,但也许在复杂的设置中,这可能会漏掉,y列中存在的数据将用作特征(X数据(,然后会给你错误的结果。

要解决此问题,请将MyLEncoder更改为:

class MyLEncoder(BaseEstimator, TransformerMixin):
# Save the enc during fitting
def fit(self, X, y, **fit_params):
enc = LeaveOneOutEncoder()
self.enc = enc.fit(np.asarray(X), y)
return self
# Here, no new learning should be done, so never call fit() inside this
# Only use the already saved enc here
def transform(self, X, **fit_params):
enc_data = self.enc.transform(np.asarray(X))
return enc_data
# No need to define this function, if you are not doing any optimisation in it.
# It will be automatically inherited from TransformerMixin
# I have only kept it here, because you kept it.
def fit_transform(self, X,y=None, **fit_params):
self.fit(X, y, **fit_params)
return self.transform(X)

现在,当您执行此操作时:

pipeline.predict(pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],}))

你不会得到任何错误,但仍然如第 4 点所述,我希望你做这样的事情:

new_df = pd.DataFrame({ 'y': [3, 8], 'a': ['a', 'b' ], 'b': [3, 6],})
new_X = new_df[['a', 'b']]
new_y = new_df['y']
pipeline.predict(new_X)

这样,训练时间中使用的 X 和预测时间中使用的 X new_X看起来相同。

我已经做了如下

lb = df['a']
class MyLEncoder(BaseEstimator, TransformerMixin):
def transform(self, X, **fit_params):
enc = LeaveOneOutEncoder()
encc = enc.fit(np.asarray(lb), y)
enc_data = encc.transform(np.asarray(X))
return enc_data
def fit_transform(self, X,y=None,  **fit_params):
self.fit(X,y,  **fit_params)
return self.transform(X)
def fit(self, X, y, **fit_params):
return self

所以我在lb上连续encc = enc.fit(np.asarray(lb), y)更改了X.

最新更新