ValueError:数据基数不明确.确保所有数组都包含相同数量的样本.卷积神经网络



我正在gcolab上运行这个卷积神经网络模型。我的目标是文本分类。这是我的代码和错误:

# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])

# load all test reviews
food_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/food', vocab, False)
location_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/location', vocab, False)
price_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/price', vocab, False)
service_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/service', vocab, False)
time_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/time', vocab, False)
test_docs = food_docs + location_docs + price_docs + service_docs + time_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2) 

这是我的模型摘要输出:

型号:";序列_1";


层(类型(输出形状参数#

embedding_1(嵌入((无,4100(415400


conv1d_1(conv1d((无,34,32(25632


最大池1d_1(最大池1(无,17,32(0


flatten_1(压扁((无,544(0


dense_2(致密((无,10(5450


dense_3(致密((无,1(11

参数总数:446493可培训参数:446493不可训练参数:0


这是我在运行最后一个单元格时发生的错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-fa9c5ed3e39a> in <module>()
2 model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
3 # fit network
----> 4 model.fit(Xtrain, ytrain, epochs=10, verbose=2)
3 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/data_adapter.py in _check_data_cardinality(data)
1527           label, ", ".join(str(i.shape[0]) for i in nest.flatten(single_data)))
1528     msg += "Make sure all arrays contain the same number of samples."
-> 1529     raise ValueError(msg)
1530 
1531 
ValueError: Data cardinality is ambiguous:
x sizes: 9473
y sizes: 1800
Make sure all arrays contain the same number of samples.

我是CNN的新手,如果有任何帮助,我将不胜感激!非常感谢。

您的培训数据仅由1800个标签组成,但您的培训输入为9473个。

>>> ytrain = np.array([0 for _ in range(900)] + [1 for _ in range(900)])
>>> ytrain.shape
(1800,)

假设你真的想为你的标签创建50%的0和50%的1,你需要将其更改为:

ytrain = np.array([0 for _ in range(len(Xtrain)//2)] + [1 for _ in range(len(Xtrain)//2)])

因此,这将创建一个数组,其中extrin的一半标签是0,另一半是1。

更新

对于不均匀的数据集,这可能会更好,因为它围绕中间索引进行拆分,因此应该处理奇数长度:

length = len(Xtrain)
middle_index = length//2
ytrain = np.array([0 for _ in range(len(Xtrain[:middle_index])) + [1 for _ in range(len(Xtrain[middle_index:]))])

最新更新