HuggingFace确定T5模型生成序列的概率

我使用HuggingFace的T5-Large进行推理。给定前提和假设，我需要确定它们是否相关。因此，如果我输入一个字符串"mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed."，模型应该返回entailment,neutral或contradiction。

虽然我能够确定结果，但我无法确定生成序列的概率。例如，考虑模型将为上面给出的示例生成entailment。我还想知道entailment的概率是多少。到目前为止，我一直在使用以下代码

from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)

premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))

我已经尝试使用我们得到的分数作为输出，但不确定如何从他们计算概率。对于可以从generate()获取输出的最后一个隐藏状态也是如此。我在Stack Overflow的另一个问题中看到，建议在最后隐藏状态上使用softmax函数，但我不确定如何做到这一点。

如何计算生成序列的概率?也就是说，如果我得到一对假设和前提的entailment,P(entailment)会是什么?

您得到的分数是softmax之前的输出令牌分布，即所谓的logits。您可以通过规范化逻辑并获取相应的令牌id来获得生成令牌的概率。您可以从generate方法返回的sequences字段中获取它们。

然而，这些不是你要找的概率，因为T5将你的输出单词分成更小的单元(例如，"蕴意")。使用t5-small标记器分割到['▁', 'en', 'tail', 'ment'])。这更加棘手，因为不同的答案被分成不同数量的令牌。你可以通过平均标记概率得到一个近似的分数(这通常在波束搜索中使用)。这些分数加起来不等于1。

如果你想要一个标准化的分数，唯一的方法是将所有三个可能的答案提供给解码器，获得它们的分数，并将它们标准化为1。

相关内容

最新更新

热门标签：