我正在gcolab上运行这个卷积神经网络模型。我的目标是文本分类。这是我的代码和错误:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
# load all test reviews
food_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/food', vocab, False)
location_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/location', vocab, False)
price_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/price', vocab, False)
service_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/service', vocab, False)
time_docs = process_docs('/content/drive/MyDrive/CNN_moviedata/data/time', vocab, False)
test_docs = food_docs + location_docs + price_docs + service_docs + time_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)
这是我的模型摘要输出:
型号:";序列_1";
层(类型(输出形状参数#
embedding_1(嵌入((无,4100(415400
conv1d_1(conv1d((无,34,32(25632
最大池1d_1(最大池1(无,17,32(0
flatten_1(压扁((无,544(0
dense_2(致密((无,10(5450
dense_3(致密((无,1(11
参数总数:446493可培训参数:446493不可训练参数:0
无
这是我在运行最后一个单元格时发生的错误
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-fa9c5ed3e39a> in <module>()
2 model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
3 # fit network
----> 4 model.fit(Xtrain, ytrain, epochs=10, verbose=2)
3 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/data_adapter.py in _check_data_cardinality(data)
1527 label, ", ".join(str(i.shape[0]) for i in nest.flatten(single_data)))
1528 msg += "Make sure all arrays contain the same number of samples."
-> 1529 raise ValueError(msg)
1530
1531
ValueError: Data cardinality is ambiguous:
x sizes: 9473
y sizes: 1800
Make sure all arrays contain the same number of samples.
我是CNN的新手,如果有任何帮助,我将不胜感激!非常感谢。
您的培训数据仅由1800个标签组成,但您的培训输入为9473个。
>>> ytrain = np.array([0 for _ in range(900)] + [1 for _ in range(900)])
>>> ytrain.shape
(1800,)
假设你真的想为你的标签创建50%的0和50%的1,你需要将其更改为:
ytrain = np.array([0 for _ in range(len(Xtrain)//2)] + [1 for _ in range(len(Xtrain)//2)])
因此,这将创建一个数组,其中extrin的一半标签是0,另一半是1。
更新
对于不均匀的数据集,这可能会更好,因为它围绕中间索引进行拆分,因此应该处理奇数长度:
length = len(Xtrain)
middle_index = length//2
ytrain = np.array([0 for _ in range(len(Xtrain[:middle_index])) + [1 for _ in range(len(Xtrain[middle_index:]))])