用于二进制分类的NLP模型为每个单词输出一个类



我基本上是在运行Francois Chollet的python深度学习第11章中的代码。这是一个二元情感分类。每个句子的标签都是0或1。在像书中那样运行模型之后;验证";句子。完整的代码是一个公共kaggle笔记本,可以在这里找到:https://www.kaggle.com/louisbunuel/deep-learning-with-python它是笔记本的一部分:https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/chapter11_part02_sequence-models.ipynb

我唯一添加的是我的";提取";标记化tensorflow数据集中的标记化句子,这样我就可以看到输出的示例。我本来希望得到一个从0到1的数字(这确实是一个概率(,但我得到了一个从零到1的数组,句子中每个单词一个。换句话说,看起来模型并没有为每个句子分配标签,而是为每个单词分配标签
有人能解释一下我做错了什么吗?这是我的方式吗;提取";tensorflow数据集中的一句话?这是笔记本中的图书/github中的代码

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
os.makedirs(val_dir / category)
files = os.listdir(train_dir / category)
random.Random(1337).shuffle(files)
num_val_samples = int(0.2 * len(files))
val_files = files[-num_val_samples:]
for fname in val_files:
shutil.move(train_dir / category / fname,
val_dir / category / fname)
train_ds = keras.utils.text_dataset_from_directory(
"aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
"aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
"aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

准备整数序列数据集

from tensorflow.keras import layers
max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)
int_train_ds = train_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=4)
int_val_ds = val_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=4)
int_test_ds = test_ds.map(
lambda x, y: (text_vectorization(x), y),
num_parallel_calls=4)
embedding_layer = layers.Embedding(input_dim=max_tokens, output_dim=256)
inputs = keras.Input(shape=(None,), dtype="int64")
embedded = layers.Embedding(input_dim=max_tokens, output_dim=256)(inputs)
x = layers.Bidirectional(layers.LSTM(32))(embedded)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
loss="binary_crossentropy",
metrics=["accuracy"])
model.summary()
callbacks = [
keras.callbacks.ModelCheckpoint("embeddings_bidir_gru.keras",
save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=2, callbacks=callbacks)
model = keras.models.load_model("embeddings_bidir_gru.keras")
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

我的"加法";代码就是这个部分。模型运行后,我取出一句这样的话:

ds = int_val_ds.take(1)     # int_val_ds is the dataframe that is already vectorized to numbers
for sentence, label in ds:  # example is (sentence, label)
print(sentence.shape, label)
>> (32, 600) tf.Tensor([1 1 1 0 1 0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0], shape=(32,), dtype=int32)

这是一批32个句子,有36个相应的标签如果我看一个元素的形状

sentence[2].shape
>> TensorShape([600])

如果我键入

model.predict(sentence[2])
>> array([[0.49958456],
[0.50042397],
[0.50184965],
[0.4992085 ],...
[0.50077164]], dtype=float32)

具有600个元件。我希望有一个介于0和1之间的数字。出了什么问题?

翻译@Tou You comment

model.predict(tf.reshape(sentence[2] , [1 , 600] )

最新更新