我以以下方式生成成对的句子顺序概率:
import itertools
import random
import numpy as np
import tensorflow as tf
from transformers import BertTokenizer, TFBertForNextSentencePrediction
np.set_printoptions(suppress=True)
cache_dir = '/path/to/cache/dir'
pretrained_weights = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(
pretrained_weights,
cache_dir=cache_dir,
)
model = TFBertForNextSentencePrediction.from_pretrained(
pretrained_weights,
cache_dir=cache_dir,
)
sentences = """
In “The Necklace” by Guy de Maupassant, the main character, Mathilde, has always dreamed of
being an aristocrat but lives in poverty.
Embarrassed about her lack of fine possessions, she borrows a necklace from a wealthy
friend but loses it.
The story is known for its subversive and influential twist ending
"""
sentences = [s.strip() for s in sentences.strip().split('.')]
random.shuffle(sentences)
print(sentences)
pairs = itertools.permutations(sentences, 2)
encoded = tokenizer.batch_encode_plus(pairs, return_tensors='np', padding=True)
outputs = model(encoded)
probs = tf.keras.activations.softmax(outputs[0])
for i, s in enumerate(sentences, 1):
print(f's{i}: {s}')
for s, prob in zip(itertools.permutations(['s1', 's2', 's3'], 2), probs):
print(s, prob)
我不确定如何解释产生的概率来生成有序的句子。
s1: Embarrassed about her lack of fine possessions, she borrows a necklace from a wealthy
friend but loses it
s2: In “The Necklace” by Guy de Maupassant, the main character, Mathilde, has always dreamed of
being an aristocrat but lives in poverty
s3: The story is known for its subversive and influential twist ending
('s1', 's2') tf.Tensor([0.9987061 0.00129389], shape=(2,), dtype=float32)
('s1', 's3') tf.Tensor([0.9514299 0.04857007], shape=(2,), dtype=float32)
('s2', 's1') tf.Tensor([0.9994491 0.00055089], shape=(2,), dtype=float32)
('s2', 's3') tf.Tensor([0.94130975 0.05869029], shape=(2,), dtype=float32)
('s3', 's1') tf.Tensor([0.15520796 0.84479207], shape=(2,), dtype=float32)
('s3', 's2') tf.Tensor([0.98460925 0.01539072], shape=(2,), dtype=float32)
更新:我根据我在这里找到的一个(不确定其准确性)创建了这个hacky解决方案,它排序句子,然而,由于计算句子的笛卡尔积的概率,要预测的总项目增加了一个平方因子ex:对于一组88个句子,总对将是88 * 88 = 7744,这不会很好地扩展。在gpu上完成推理速度不会有太大问题,但仍然欢迎更好的建议来实现相同的结果。
class HashableDict(dict): # Passed to `tf.keras.Model.predict` to enable batching
def __hash__(self):
return hash(tuple(self.items()))
def create_correlation_matrix(sentences, tokenizer, model, **kwargs):
np.set_printoptions(suppress=True)
pairs = itertools.product(sentences, repeat=2)
encoded = tokenizer.batch_encode_plus(pairs, return_tensors='np', padding=True)
logits = model.predict(HashableDict(**encoded), **kwargs)
probs = tf.keras.activations.softmax(tf.convert_to_tensor(logits[0]))
size = len(sentences)
return probs[:, 0].numpy().reshape(size, size)
def reorder_sentences(sentences, tokenizer, model, **kwargs):
ordered = []
correlation_matrix = create_correlation_matrix(
sentences, tokenizer, model, **kwargs
)
idx = np.unravel_index(
np.argmax(correlation_matrix, axis=None), correlation_matrix.shape
)
while correlation_matrix.any():
x_idx = idx[1]
correlation_matrix[idx[0], :] = 0
correlation_matrix[:, idx[0]] = 0
ordered.append(idx[0])
idx = np.unravel_index(
np.argmax(correlation_matrix[x_idx, :], axis=None),
correlation_matrix[x_idx, :].shape,
)
idx = (x_idx, idx[0])
return ordered
if __name__ == '__main__':
cache_dir = '/path/to/cache/dir'
pretrained_weights = 'bert-base-multilingual-cased'
tok = BertTokenizer.from_pretrained(
pretrained_weights,
cache_dir=cache_dir,
)
m = TFBertForNextSentencePrediction.from_pretrained(
pretrained_weights,
cache_dir=cache_dir,
)
s = """
In “The Necklace” by Guy de Maupassant, the main character, Mathilde, has always dreamed of
being an aristocrat but lives in poverty.
Embarrassed about her lack of fine possessions, she borrows a necklace from a wealthy
friend but loses it.
The story is known for its subversive and influential twist ending
"""
s = [ss.strip() for ss in s.strip().split('.')]
print(s)
random.shuffle(s)
print(s)
ordering = reorder_sentences(s, tok, m, verbose=True, batch_size=8)
reordered_sentences = [s[idx] for idx in ordering]
print(ordering, reordered_sentences)
结果:
2/2 [==============================] - 9s 198ms/step
[1, 0, 2] ['In “The Necklace” by Guy de Maupassant, the main character, Mathilde, has always dreamed of n being an aristocrat but lives in poverty', 'Embarrassed about her lack of fine possessions, she borrows a necklace from a wealthy n friend but loses it', 'The story is known for its subversive and influential twist ending']
输出概率表示第一个句子后面跟着第二个句子的可能性。
这两个概率之和为1,因为这是一个二元问题,它们表示回答"不"的概率;对问题(第一个值)和回答"是"的概率(第二个值).
例如,在你的3个句子中,只有s3被预测可能后面跟着s1,概率为84%。所有其他对预测为负。
根据这些结果,您可以获得一个偏图来构建可能的有序序列。可能有几种选择,例如,你可以从最高概率开始,或者从每个句子中选择最可能的句子。
编辑:获取最可能顺序的设计思路
注意,这是一个开放的设计问题,我不知道有什么标准的方法来解决这个问题。这可能是一个研究问题,所以可能会有研究论文来解决它。下面是我将如何尝试这样做。
- 表示从一个句子到另一个句子的所有可能转换的完整图。句子是顶点,有向弧连接两个句子,以概率作为权重。在这一部分中,丢弃所有正概率小于某个阈值$t$(例如0.1)的弧线可能会很有用。
- 然后我们需要找到一个(顶点不相交的)路径覆盖,使权重/概率的乘积最大化。这是一个复杂的问题,它可能无法用一个确切的方法(这绝对超出了我的工资级别!)。遗传算法可能会奏效。我可以想象一个更简化的解决方案如下:
- 从没有指向它们的边或只有低概率的首句中选择一组候选首句。
- 对于每一个候选的第一个句子,遵循路径,总是选择最可能的下一个句子。
- 丢弃任何不连接所有句子的路径。在某些情况下,可能没有解决方案,所以必须有一些变通办法。
- 最后选择使概率乘积最大化的路径(提示:对概率的对数求和)
选择的路径给出了句子的有序序列。