如何提取和使用句子的BERT编码来判断句子之间的文本相似性.(PyTorch/Tensorflow)

我想建立一个文本相似性模型，我倾向于使用它来查找常见问题和其他方法，以获得最相关的文本。我想在这个NLP任务中使用高度优化的BERT模型。我倾向于使用所有句子的编码来使用cosine_similarity获得相似性矩阵并返回结果。

在假设条件下，如果我有两个句子hello world和hello hello world，那么我假设BRT会给我一些类似[0.2,0.3,0]、(0表示填充(和[0.2,0.2,0.3]的东西，我可以在sklearn'scosine_similarity内通过这两个。

我应该如何提取嵌入语句以在模型中使用它们？我在某个地方找到了可以提取的东西，比如：

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :]  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

这是正确的路吗？因为我在某个地方读到BERT提供了不同类型的嵌入。

此外，请建议任何其他方法来查找文本相似性

当您想比较句子的嵌入时，建议使用BERT的方法是使用CLS标记的值。这对应于输出的第一个标记(在批处理维度之后(。

last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]

这将为整个句子提供一个嵌入。由于每个句子都有相同大小的嵌入，因此可以很容易地计算余弦相似度。

如果使用CLS标记没有得到令人满意的结果，也可以尝试对句子中每个单词的输出嵌入进行平均。

相关内容

最新更新

热门标签：