用于在 sklearn 管道中进行分类的图像数组 - 值错误:使用序列设置数组元素



我有一个图像,我想将其分类为A或B。为此,我加载它们并将其大小调整为 160x160,然后将 2D 数组转换为 1D 并将它们添加到熊猫数据帧:

from pandas import DataFrame
from scipy.misc import imread, imresize
rows = []
for product in products:
try:
relevant = product.categoryrelevant.all()[0].relevant
except IndexError:
relevant = False
if relevant:
relevant = "A"
else:
relevant = "B"
# this exists for all pictures
image_array = imread("{}/{}".format(MEDIA_ROOT, product.picture_file.url))
image_array = imresize(image_array, (160, 160))
image_array = image_array.reshape(-1)
print(image_array)
# [254 254 252 ..., 255 255 253]
print(image_array.shape)
# (76800,)
rows.append({"id": product.pk, "image": image_array, "class": relevant})
index.append(product)
df = DataFrame(rows, index=index)

我希望的不仅仅是稍后用于分类的图像(例如,产品描述(,所以我正在使用带有 FeatureUnion 的管道(即使它现在只有图像(。ItemSelector取自这里:

http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html

它采用"图像"列中的值。或者 ìt 可以做train_X = df.iloc[train_indices]["image"].values,但我想稍后添加其他列。

def randomforest_image_pipeline():
"""Returns a RandomForest pipeline."""
return Pipeline([
("union", FeatureUnion(
transformer_list=[
("image", Pipeline([
("selector", ItemSelector(key="image")),
]))
],
transformer_weights={
"image": 1.0
},
)),
("classifier", RandomForestClassifier()),
])

然后用KFold分类:

from sklearn.model_selection import KFold
kfold(tested_pipeline=randomforest_image_pipeline(), df=df)
def kfold(tested_pipeline=None, df=None, splits=6):
k_fold = KFold(n_splits=splits)
for train_indices, test_indices in k_fold.split(df):
# training set
train_X = df.iloc[train_indices]
train_y = df.iloc[train_indices]['class'].values
# test set
test_X = df.iloc[test_indices]
test_y = df.iloc[test_indices]['class'].values
for val in train_X["image"]:
print(len(val), val.dtype, val.shape)
# 76800 uint8 (76800,) for all
tested_pipeline.fit(train_X, train_y) # crashes in this call
pipeline_predictions = tested_pipeline.predict(test_X)
...

但是.fit我收到以下错误:

Traceback (most recent call last):
File "<path>/project/classifier/classify.py", line 362, in <module>
best = best_pipeline(dataframe=data, f1_scores=f1_dict, get_fp=True)
File "<path>/project/classifier/classify.py", line 351, in best_pipeline
confusion_list=confusion_list, get_fp=get_fp)
File "<path>/project/classifier/classify.py", line 65, in kfold
tested_pipeline.fit(train_X, train_y)
File "/usr/local/lib/python3.5/dist-packages/sklearn/pipeline.py", line 270, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "/usr/local/lib/python3.5/dist-packages/sklearn/ensemble/forest.py", line 247, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

我发现其他人也有同样的问题,对他们来说,问题是他们的行长度不同。对我来说似乎并非如此,因为所有行都是一维的,长度为 76800:

for val in train_X["image"]:
print(len(val), val.dtype, val.shape)
# 76800 uint8 (76800,) for all

崩溃的行中的array如下所示(从调试器复制(:

[array([ 255.,  255.,  255., ...,  255.,  255.,  255.])
array([ 255.,  255.,  255., ...,  255.,  255.,  255.])
array([ 255.,  255.,  255., ...,  255.,  255.,  255.]) ...,
array([ 255.,  255.,  255., ...,  255.,  255.,  255.])
array([ 255.,  255.,  255.

我该怎么做才能解决这个问题?

错误是因为您将图像的所有数据(即 76800 个要素(保存在一个列表中,并且该列表被保存到数据帧的单个列中。

因此,当您使用 ItemSelector 选择该列时,该列的输出将是形状(Train_len, )的单维数组。76800 的内部维度对特征联盟或后续估算器不可见。

更改 ItemSelector 的transform()函数以返回具有形状 (Train_len, 76800( 的正确二维数据数组。只有这样它才会起作用。

更改为:

def transform(self, data_dict):
return np.array([np.array(x) for x in data_dict[self.key]])

如果什么都不懂,请随时询问。

最新更新