Spacy 3.0 Matcher删除重叠并保留所用图案的信息



是否有一种更短、更干净或内置的方法可以从Matcher中删除重叠匹配的结果,同时保留用于匹配的Pattern的值?这样你就可以判断出哪种模式会产生匹配结果。模式ID最初是根据匹配器的结果给出的,但我看到的消除重叠的解决方案会减少ID号。

以下是我目前使用的解决方案,它有效,但有点长:

import spacy
from spacy.lang.en import English
from spacy.matcher import Matcher
text ="United States vs Canada, Canada vs United States, United States vs United Kingdom, Mark Jefferson vs College, Clown vs Jack Cadwell Jr., South America Snakes vs Lopp, United States of America, People vs Jack Spicer"
doc = nlp(text)
#Matcher
matcher=Matcher(nlp.vocab) 
# Two patterns
pattern1 = [{"POS": "PROPN", "OP": "+", "IS_TITLE":True}, {"TEXT": {"REGEX": "vs$"}}, {"POS": "PROPN", "OP": "+", "IS_TITLE":True}]
pattern2 =[{"POS": "ADP"},{"POS": "PROPN", "IS_TITLE":True}]
matcher.add("Games", [pattern1])
matcher.add("States", [pattern2])
#Output stored as list of tuples with the following: (pattern name ID, pattern start, pattern end) 
matches = matcher(doc)

首先,我将结果存储在字典中,其中元组列表作为值,模式名称作为关键

result = {}
for key, subkey, value in matches:
result.setdefault(nlp.vocab.strings[key], []).append((subkey,value))
print(result)

打印到:

{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)],
'Games': [(1, 4), (0, 4), (5, 8), (5, 9), (11, 14), (10, 14), (11, 15), (10, 15), (17, 20),
(16, 20), (21, 24), (21, 25), (21, 26), (38, 41), (38, 42)]}

然后我对结果进行迭代,并使用filter_spans来消除重叠,并将开始和结束附加为元组:

for key, value in result.items():
new_vals = [doc[start:end] for start, end in value]
val2 =[]
for span in spacy.util.filter_spans(new_vals):
val2.append((span.start, span.end))
result[key]=val2
print(result)

这会打印一个没有重叠的结果列表:

{'States': [(2, 4), (6, 8), (12, 14), (18, 20), (22, 24), (30, 32), (35, 37), (39, 41)], 
'Games': [(0, 4), (5, 9), (10, 15), (16, 20), (21, 26), (38, 42)]}

要获得文本值,只需循环模式并打印跨度:

print ("---Games---")
for start, end in result['Games']:
span =doc[start:end] 
print (span.text)
print (" ")
print ("---States---")
for start, end in result['States']:
span =doc[start:end] 
print (span.text)

输出:

---Games---
United States vs Canada
Canada vs United States
United States vs United Kingdom
Mark Jefferson vs College
Clown vs Jack Cadwell Jr.
People vs Jack Spicer

---States---
vs Canada
vs United
vs United
vs College
vs Jack
vs Lopp
of America
vs Jack

在处理过程中,您可以创建保留标签的新跨度,而不是使用不包括标签的doc[start:end]

from spacy.tokens import Span
span = Span(doc, start, end, label=match_id)

比spaCy v3.0+更容易的是使用匹配器选项as_spans:

import spacy
from spacy.matcher import Matcher
nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("A", [[{"ORTH": "a", "OP": "+"}]])
matcher.add("B", [[{"ORTH": "b"}]])
matched_spans = matcher(nlp("a a a a b"), as_spans=True)
for span in spacy.util.filter_spans(matched_spans):
print(span.label_, ":", span.text)

最新更新