scikit学习:ColumnTransformer和OneHotEncoder——如何在所有字段中为所有新的分类级别出



我正试图使用scikit的ColumnTransformer类作为实际的DataFrame转换器作为"监视"转换器,即当新类进入我的数据集中的分类特征时要监视的对象。

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Original DataFrame off of which transformers are fit
orig_df = pd.DataFrame(
{
'a': [np.nan, 'a', 'b', 'b', 'a'],
'b': ([np.nan] * 3) + ['a', 'a'],
'c': np.random.randn(5)
}
)
# New DataFrame that will be transformed using already fitted transformer
new_df = pd.DataFrame(
{
'a': [np.nan, 'a', 'b', 'b', 'c'],
'b': ([np.nan] * 4) + ['b'],
'c': np.random.randn(5)
}
)
# Cast NaNs to str to play nicely with OneHotEncoder
for col in ('a', 'b'):
orig_df[col] = orig_df[col].astype(str)
new_df[col] = new_df[col].astype(str)
# Create master transformer for each of the three columns a, b, and c
transformer_config = [
('a', OneHotEncoder(sparse=False, handle_unknown='error'), ['a']),
('b', OneHotEncoder(sparse=False, handle_unknown='error'), ['b']),
('c', 'passthrough', ['c']),
]
transformer = ColumnTransformer(transformer_config)
# Fit to original dataset
transformer.fit(orig_df)
# Transform new dataset
transformer.transform(new_df)

哪个生产:

File "<stdin>", line 2, in <module>
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 495, in transform
Xs = self._fit_transform(X, None, _transform_one, fitted=True)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py", line 393, in _fit_transform
fitted=fitted, replace_strings=True))
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
if self.dispatch_one_batch(iterator):
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
self._dispatch(tasks)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
result = ImmediateResult(func)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
self.results = batch()
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
for func, args, kwargs in self.items]
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
for func, args, kwargs in self.items]
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/pipeline.py", line 605, in _transform_one
res = transformer.transform(X)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 591, in transform
return self._transform_new(X)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 553, in _transform_new
X_int, X_mask = self._transform(X, handle_unknown=self.handle_unknown)
File "/Users/user/setup/venv/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.py", line 109, in _transform
raise ValueError(msg)
ValueError: Found unknown categories ['c'] in column 0 during transform

这会产生我通常想要的错误,但只针对一列。正如您在new_df中看到的,列b也有一个新的级别('b')。有没有一种简单的方法可以报告使用这个OneHotEncoder类的所有字段的所有新级别,而不仅仅是第一个出错的级别?

我的第一个想法是尝试逐个迭代每个字段,尝试捕获每个ValueError,但这在ColumnTransformer:中效果不佳

>>> transformer.transform(new_df[['b']])
KeyError: "None of [['a']] are in the [columns]"

只是您的示例的一个建议解决方案:

from sklearn.base import BaseEstimator
for _, t_inst, t_col in transformer.transformers_:
try:
if isinstance(t_inst, BaseEstimator):
t_inst.transform(new_df[t_col])
else:
pass
except Exception as e:
print('During transformation of column {} the following error occurred: {}'.format(t_col, e))

输出

During transformation of column ['a'] the following error occured: Found unknown categories ['c'] in column 0 during transform
During transformation of column ['b'] the following error occured: Found unknown categories ['b'] in column 0 during transform

它只是尝试一个接一个地应用转换。

注意,.transformers_属性只有在拟合后才可用

相关内容

  • 没有找到相关文章

最新更新