我是机器学习的新手,我一直在使用无监督的学习技术。
图像显示了我的示例数据(所有清洁术后)屏幕截图:样本数据
我有这两个管道来清洁数据:
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
print(type(num_attribs))
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
然后,我进行了这两个管道的联合,并且代码相同的代码如下:
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
现在,我正在尝试在数据上进行fit_transform,但它显示了错误。
转换代码:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
错误消息:
fit_transform()获得2个位置参数,但给出了3个
问题:
管道假设LabelBinarizer的fit_transform
方法被定义为进行三个位置参数:
def fit_transform(self, x, y)
...rest of the code
定义仅需两个:
def fit_transform(self, x):
...rest of the code
可能的解决方案:
可以通过制作可以处理3个位置参数的自定义变压器来解决:
导入并成立新类:
from sklearn.base import TransformerMixin #gives fit_transform method for free class MyLabelBinarizer(TransformerMixin): def __init__(self, *args, **kwargs): self.encoder = LabelBinarizer(*args, **kwargs) def fit(self, x, y=0): self.encoder.fit(x) return self def transform(self, x, y=0): return self.encoder.transform(x)
仅保留代码相同而不是使用labelbinarizer(),请使用我们创建的类:mylabelbinarizer()。
注意:如果要访问labelbinarizer属性(例如类_),请在
fit
方法中添加以下行:
self.classes_, self.y_type_, self.sparse_input_ = self.encoder.classes_, self.encoder.y_type_, self.encoder.sparse_input_
我相信您的示例来自Scikit-Learn& amp; amp; amp;TensorFlow 。不幸的是,我也遇到了这个问题。scikit-learn
(0.19.0
)的最新更改更改了LabelBinarizer
的fit_transform
方法。不幸的是,LabelBinarizer
从未打算使用该示例如何使用它。您可以在此处和此处查看有关更改的信息。
直到他们提出解决方案,您可以按以下方式安装上一个版本(0.18.0
):
$ pip install scikit-learn==0.18.0
运行该代码应无问题运行。
将来,看起来正确的解决方案可能是使用CategoricalEncoder
类或类似的类别。显然,他们一直试图解决这个问题。您可以在此处查看新课程,并在此处进一步讨论问题。
我想您正在浏览本书中的示例:动手机器学习使用Scikit学习和张量。在第2章中浏览示例时,我遇到了相同的问题。
正如其他人提到的,问题与Sklearn的Labelbinarizer有关。与管道中的其他变压器相比,它的fit_transform方法所花费的较少。(只有当其他变压器通常同时同时X和Y时,请参见此处以获取详细信息)。这就是为什么当我们运行pipeline.fit_transform时,我们将更多的ARG喂入此变压器中。
我使用的简单修复程序是只使用onehotencoder并将"稀疏"设置为false,以确保输出与num_pipeline输出相同。(这样,您就不需要编码自己的自定义编码)
您的原始cat_pipeline:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
您可以简单地将此部分更改为:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('one_hot_encoder', OneHotEncoder(sparse=False))
])
您可以从这里走,一切都应该起作用。
,因为labelbinarizer不允许超过2个位置参数,您应该创建自定义二进制器
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
enc = LabelBinarizer(sparse_output=self.sparse_output)
return enc.fit_transform(X)
num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scalar', StandardScaler())
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', CustomLabelBinarizer())
])
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(new_housing)
我遇到了相同的问题,并通过应用书的github repo中指定的解决方法来使它起作用。
警告:本书的早期版本使用了labelbinarizer类 这点。同样,这是不正确的:就像LabelenCoder一样 班级,Labelbinarizer类旨在预处理标签,而不是 输入功能。一个更好的解决方案是使用Scikit-Learn的即将到来 分类语编码类:很快将添加到Scikit-Learn,并且 同时,您可以使用下面的代码(从拉请请求复制 #9151)。
为了节省一些grepp,这是解决方法,只需粘贴并在以前的单元格中运行:
# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, do not try to understand it (yet).
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
class CategoricalEncoder(BaseEstimator, TransformerMixin):
def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
handle_unknown='error'):
self.encoding = encoding
self.categories = categories
self.dtype = dtype
self.handle_unknown = handle_unknown
def fit(self, X, y=None):
"""Fit the CategoricalEncoder to X.
Parameters
----------
X : array-like, shape [n_samples, n_feature]
The data to determine the categories of each feature.
Returns
-------
self
"""
if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
template = ("encoding should be either 'onehot', 'onehot-dense' "
"or 'ordinal', got %s")
raise ValueError(template % self.handle_unknown)
if self.handle_unknown not in ['error', 'ignore']:
template = ("handle_unknown should be either 'error' or "
"'ignore', got %s")
raise ValueError(template % self.handle_unknown)
if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
raise ValueError("handle_unknown='ignore' is not supported for"
" encoding='ordinal'")
X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
n_samples, n_features = X.shape
self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
for i in range(n_features):
le = self._label_encoders_[i]
Xi = X[:, i]
if self.categories == 'auto':
le.fit(Xi)
else:
valid_mask = np.in1d(Xi, self.categories[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(Xi[~valid_mask])
msg = ("Found unknown categories {0} in column {1}"
" during fit".format(diff, i))
raise ValueError(msg)
le.classes_ = np.array(np.sort(self.categories[i]))
self.categories_ = [le.classes_ for le in self._label_encoders_]
return self
def transform(self, X):
"""Transform X using one-hot encoding.
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data to encode.
Returns
-------
X_out : sparse matrix or a 2-d array
Transformed input.
"""
X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
n_samples, n_features = X.shape
X_int = np.zeros_like(X, dtype=np.int)
X_mask = np.ones_like(X, dtype=np.bool)
for i in range(n_features):
valid_mask = np.in1d(X[:, i], self.categories_[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
# Set the problematic rows to an acceptable value and
# continue `The rows are marked `X_mask` and will be
# removed later.
X_mask[:, i] = valid_mask
X[:, i][~valid_mask] = self.categories_[i][0]
X_int[:, i] = self._label_encoders_[i].transform(X[:, i])
if self.encoding == 'ordinal':
return X_int.astype(self.dtype, copy=False)
mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
indices = np.cumsum(n_values)
column_indices = (X_int + indices[:-1]).ravel()[mask]
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
n_features)[mask]
data = np.ones(n_samples * n_features)[mask]
out = sparse.csc_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return out
简单地,您可以做的是在管道之前定义以下类:
class NewLabelBinarizer(LabelBinarizer):
def fit(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X)
def transform(self, X, y=None):
return super(NewLabelBinarizer, self).transform(X)
def fit_transform(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X).transform(X)
然后,其余的代码就像书中在 cat_pipeline
中提到的,在管道串联之前对CC_10进行了微小的修改 - 遵循:
cat_pipeline = Pipeline([
("selector", DataFrameSelector(cat_attribs)),
("label_binarizer", NewLabelBinarizer())])
您完成了!
忘记laberbinarizer,然后改用onehotencoder。
如果您在onehotencoder之前使用labelencoder将类别转换为整数,则可以直接使用onehotencoder。
我也遇到了相同问题。以下链接帮助我解决了这个问题。https://github.com/ageron/handson-ml/issues/75
总结要进行的更改
1)在笔记本中定义以下课程
class SupervisionFriendlyLabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
return super(SupervisionFriendlyLabelBinarizer,self).fit_transform(X)
2)修改以下代码
cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', SupervisionFriendlyLabelBinarizer()),])
3)重新运行笔记本。您现在可以运行
我遇到了同样的问题,并通过使用dataframemapper解决了解决方案(需要安装sklearn_pandas):
from sklearn_pandas import DataFrameMapper
cat_pipeline = Pipeline([
('label_binarizer', DataFrameMapper([(cat_attribs, LabelBinarizer())])),
])
您可以创建一个为您完成编码的自定义变压器。
class CustomLabelEncode(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return LabelEncoder().fit_transform(X);
在此示例中,我们已经完成了LabElencododizer,但您也可以使用labelbinarizer
LabelBinarizer
类用于此示例,不幸的是从来没有
您需要使用sklearn.preprocessing
的OrdinalEncoder
类,该类旨在
"编码分类特征作为整数数组。"(Sklearn文档)。
所以,只要添加:
from sklearn.preprocessing import OrdinalEncoder
然后用代码中的OrdinalEncoder()
替换LabelBinarizer()
的所有提及。
我已经看到了许多自定义标签二进制器为我工作。
class LabelBinarizerPipelineFriendly(LabelBinarizer):
def fit(self, X, y=None):
"""this would allow us to fit the model based on the X input."""
super(LabelBinarizerPipelineFriendly, self).fit(X)
def transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).transform(X)
def fit_transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)
然后对此进行编辑:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizerPipelineFriendly()),
])
有一个好!
我最终滚动了自己
class LabelBinarizer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
X = self.prep(X)
unique_vals = []
for column in X.T:
unique_vals.append(np.unique(column))
self.unique_vals = unique_vals
def transform(self, X, y=None):
X = self.prep(X)
unique_vals = self.unique_vals
new_columns = []
for i, column in enumerate(X.T):
num_uniq_vals = len(unique_vals[i])
encoder_ring = dict(zip(unique_vals[i], range(len(unique_vals[i]))))
f = lambda val: encoder_ring[val]
f = np.vectorize(f, otypes=[np.int])
new_column = np.array([f(column)])
if num_uniq_vals <= 2:
new_columns.append(new_column)
else:
one_hots = np.zeros([num_uniq_vals, len(column)], np.int)
one_hots[new_column, range(len(column))]=1
new_columns.append(one_hots)
new_columns = np.concatenate(new_columns, axis=0).T
return new_columns
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
@staticmethod
def prep(X):
shape = X.shape
if len(shape) == 1:
X = X.values.reshape(shape[0], 1)
return X
似乎有效
lbn = LabelBinarizer()
thingy = np.array([['male','male','female', 'male'], ['A', 'B', 'A', 'C']]).T
lbn.fit(thingy)
lbn.transform(thingy)
返回
array([[1, 1, 0, 0],
[1, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 1]])
最简单的方法是用ordinalencoder()
def _binarize(series: pd.Series) -> pd.Series:
return series.astype(int)
binary_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy="most_frequent")),
('binary_encoder', FunctionTransformer(_binarize))
])
您可以在代码中使用此LabelBinarizer
修改类别:
class mod_LabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X)
现在您可以在cat_pipeline
中使用mod_LabelBinarizer()
而不是LabelBinarizer()
,因此您的代码应该是这样的:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', mod_LabelBinarizer())
])
我们可以添加属性sparce_output = false
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer(sparse_output=False)),
])