Spacy匹配器,具有跨令牌的正则表达式



我有以下句子:

phrases = ['children externalize their emotions through outward behavior',
'children externalize hidden emotions.',
'children externalize internalized emotions.',
'a child might externalize a hidden emotion through misbehavior',
'a kid might externalize some emotions through behavior',
'traumatized children externalize their hidden trauma through bad behavior.',
'The kid is externalizing internal traumas',
'A child might externalize emotions though his outward behavior',
'The kid externalized a lot of his emotions through misbehavior.']

我想抓住动词externate后面的任何名词;外部化、外部化等

在这种情况下;我们应该得到:

externalize their emotions
externalize hidden emotions
externalize internalized emotions
externalize a hidden emotion
externalize some emotions
externalize their hidden trauma
externalizing internal traumas
externalized a lot of his emotions

到目前为止,如果名词位于动词externalize之后,我只能捕获它

我想抓住这个名词;如果它恰好在少于15个字符之后。例如:把很多情绪外在化这应该是匹配的;因为(他的很多(只有14个字符;计算空间。

这是我的作品,远非完美。

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher =  Matcher(vocab = nlp.vocab)
verb_noun = [{'POS':'VERB'}, {'POS':'NOUN'}]
matcher.add('verb_noun', None, verb_noun)
list_result = []
for phrase in phrases:
doc = nlp(phrase)
doc_match = matcher(doc)
if doc_match:
for match in doc_match:
start = match[1]
end = match[2]
result = doc[start:end]
result = [i.lemma_ for i in result]
if 'externaliz' in result[0].lower():
result = ' '.join(result)
list_result.append(result)

我想抓住名词;如果它恰好在少于15个字符之后。例如:将很多应该匹配的情绪外化;因为(他的很多(只有14个字符;计算空间。

你可以这样做,但我不建议这样做。你应该写一个正则表达式来匹配字符串,并使用Doc.char_span创建match。由于Matcher在令牌上工作;14个字符,包括空格";无法合理实施。同样,这种启发式是一种破解,并且会执行不稳定。

我怀疑你实际上想做的是弄清楚什么是外化的,也就是说,找到动词的宾语。在这种情况下,您应该使用DependencyMatcher。以下是一个将其与简单规则结合使用并合并名词块的示例:

import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
texts = ['children externalize their emotions through outward behavior',
'children externalize hidden emotions.',
'children externalize internalized emotions.',
'a child might externalize a hidden emotion through misbehavior',
'a kid might externalize some emotions through behavior',
'traumatized children externalize their hidden trauma through bad behavior.',
'The kid is externalizing internal traumas',
'A child might externalize emotions though his outward behavior',
'The kid externalized a lot of his emotions through misbehavior.']
pattern = [
{
"RIGHT_ID": "externalize",
"RIGHT_ATTRS": {"LEMMA": "externalize"}
},
{
"LEFT_ID": "externalize",
"REL_OP": ">",
"RIGHT_ID": "object",
"RIGHT_ATTRS": {"DEP": "dobj"}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("EXTERNALIZE", [pattern])
# what was externalized?
# this is optional: merge noun phrases
nlp.add_pipe("merge_noun_chunks")
for doc in nlp.pipe(texts):
for match_id, tokens in  matcher(doc):
# tokens[0] is like "externalize"
print(doc[tokens[1]])

输出:

their emotions
hidden emotions
internalized emotions
a hidden emotion
some emotions
their hidden trauma
internal traumas
emotions
his outward behavior
a lot

最新更新