如何处理每个句子以查找并替换匹配的单词的同义词?



我目前正在处理空格,并有一个语料库(包含960,256个单词),看起来像这样:

['The EMD F7 was a 1,500 horsepower (1,100 kW) Diesel-electric locomotive produced between February 1949 and December 1953 by the Electro-Motive Division of General Motors (EMD) and General Motors Diesel (GMD). ',
'Third stream ',
"Gil Evans' influence ",
'The horn in the spotlight ',
'Contemporary horn in jazz ']

我有一个函数查找一个词的同义词(使用空格):

def most_similar(word, topn=5):
word = nlp.vocab[str(word)]
queries = [
w for w in word.vocab 
if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
]

by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

返回的答案数组如下:

[('dogs', 0.8835931), ('puppy', 0.85852146), ('pet', 0.8057451)]

然后,我有一个方法来替换一个词为另一个,像这样:

def replace_word(orig_text, replacement):
tok = nlp(orig_text)
text = ''
buffer_start = 0
for _, match_start, _ in matcher(tok):
if match_start > buffer_start:  # If we've skipped over some tokens, let's add those in (with trailing whitespace if available)
text += tok[buffer_start: match_start].text + tok[match_start - 1].whitespace_
text += replacement + tok[match_start].whitespace_  # Replace token, with trailing whitespace if available
buffer_start = match_start + 1
text += tok[buffer_start:].text
return text

它的工作原理是让句子替换一个词,然后像这样:

replace_word("Hi this dog is my dog.", "Simba")

输出只是替换单词后的句子:

Hi this Simba is my Simba.

在工作之前,Matcher必须这样定义:

matcher = Matcher(nlp.vocab)
matcher.add("dog", None, [{"LOWER": "dog"}])

或添加patterns,如:

patterns = [
[{"LOWER": "amazing"}, {"LOWER": "anger"}, {"LOWER": "angry"}, {"LOWER": "answer"}, {"LOWER": "ask"}, {"LOWER": "awful"}, {"LOWER": "bad"}]
]

我想要的是抓取语料库,逐句逐句地将其输入most_similar,这样我就可以保存要替换的单词列表,并通过使用replace_word这样做,事情是我不确定如何做到这一点。我已经尝试了一段时间,但它总是以某种方式失败(要么不会采取批处理,所以我不能一次做到这一点,如果我简单地通过.split(" ")分割每个句子,这些词最终会成为空向量……你能帮我一下吗?

我希望我正确理解了你的需求。我猜你想:

  1. 遍历语料库
  2. 使用匹配器查找特定的令牌
  3. 查找匹配令牌的同义词
  4. 返回一个新的句子列表,但带有替换的标记。

如果是这种情况,那么你需要的是一个有效的相似性函数(我尝试了上面的一个,但它不适合我),但你可以尝试这个:

def most_similar(word, topn):
words = []
target = nlp.vocab.strings[word]
if target in nlp.vocab.vectors:
synonyms = nlp.vocab.vectors.most_similar(np.asarray([nlp.vocab.vectors[target]]), n=topn)
words = [nlp.vocab.strings[w].lower() for w in synonyms[0][0] if nlp.vocab.strings[w].lower() != word.lower()]
return words

您还提到您希望在语料库上运行。我建议您将nlp.pipe()方法与set_extension方法结合使用,以获得性能提升。你可以这样做:

# First of all we create a component to add to the pipe
@Language.component("synonym_replacer")
def synonym_replacer(doc):
if not Doc.has_extension("synonyms"):
Doc.set_extension("synonyms", default=[])
doc._.synonyms.extend(list(replace_synonyms(doc, 4)))
return doc
# This will replace matched tokens by their synonyms
def replace_synonyms(doc, topn):
for sent in doc.sents:
matches = matcher(sent)
for _, start, end in matches:
span = sent[start:end]
syns = most_similar(span.text, topn)
for syn in syns:
yield nlp.make_doc(sent[:start].text_with_ws + f"{syn} " + sent[end:].text_with_ws)

现在你有了所有的函数,你可以准备你的管道,并在所有语料库上执行它:

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("synonym_replacer")
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "dog"}]]
matcher.add("dog", patterns)
corpus = ["I have a great dog", "Hi this dog is my dog."]
docs = nlp.pipe(corpus)
for doc in docs:
print(doc.text)
print(doc._.synonyms)
print("****")
# Output
# I have a great dog
# [I have a great dogs , I have a great puppy , I have a great pet ]
# ****
# Hi this dog is my dog.
# [Hi this dogs is my dog., Hi this puppy is my dog., Hi this pet is my dog., #Hi this dog is my dogs ., Hi this dog is my puppy ., Hi this dog is my pet .]
# ****

最新更新