Scikit-learn 扩展自定义独热编码矩阵 - 不是从数据集构建的 - Scikit-learn Expanded custom one-hot encoded matrix

我正在尝试构造一个独热编码矩阵，该矩阵表示在我的示例中找不到的其他类别。

如果使用以下代码：

s = np.array(['man', 'man', 'woman', 'woman', 'son', 'son', 'son', 'son', 'son'])
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(s)
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
Y = onehot_encoder.fit_transform(integer_encoded)
print(Y)

结果是这样的：

[[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]]

但实际上，我有以下类别，其中一些类别在我的数据集中不存在，但我需要考虑它们：

categories = np.array(['man', 'woman', 'son', 'daughter', 'boy', 'girl', 'king', 'queen', 'baby', 'child'])

因此，我需要的是：

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]

因此，我试图弄清楚如何在这段代码中实现OneHotEncoder(sparse=False，categories=categories(：

categories = np.array(['man', 'woman', 'son', 'daughter', 'boy', 'girl', 'king', 'queen', 'baby', 'child'])
s = np.array(['man', 'man', 'woman', 'woman', 'son', 'son', 'son', 'son', 'son'])
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(s)
onehot_encoder = OneHotEncoder(sparse=False, categories=categories)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
Y = onehot_encoder.fit_transform(integer_encoded)
print(Y)

但它给出了以下错误：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

如果我更改：

integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
to
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1).all()

我收到以下错误：

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

有人可以帮我解决这个问题吗？

这里的问题是OneHotEncoder的categories参数
您的categories变量ndarray并且正在提高ValueError。
尝试使用常规排序list。
而且，您不需要在情况下使用LabelEncoder。

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
categories = [sorted(['man', 'woman', 'son',
'daughter', 'boy', 'girl',
'king', 'queen', 'baby', 'child'])]
print(f'sorted categories: {categories}')
s = np.array(['man', 'man', 'woman', 'woman',
'son', 'son', 'son', 'son', 'son']).reshape(-1, 1)
onehot_encoder = OneHotEncoder(sparse=False, categories=categories)
Y = onehot_encoder.fit_transform(s)
print(Y)

sorted categories: [['baby', 'boy', 'child', 'daughter', 'girl', 'king', 'man', 'queen', 'son', 'woman']]
[[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]

Scikit-learn 扩展自定义独热编码矩阵 - 不是从数据集构建的

相关内容

最新更新

热门标签：