我想实现一些文本操作作为关键短语提取的预处理。请看下面的例子:
import spacy
text = "conversion of existing underground gas storage facilities into storage facilities dedicated to hydrogen-storage"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for token in doc:
print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')
结果:
conversion NOUN NN ROOT noun noun, singular or mass
of ADP IN prep adposition conjunction, subordinating or preposition
existing VERB VBG amod verb verb, gerund or present participle
underground ADJ JJ amod adjective adjective (English), other noun-modifier (Chinese)
gas NOUN NN compound noun noun, singular or mass
storage NOUN NN compound noun noun, singular or mass
facilities NOUN NNS pobj noun noun, plural
into ADP IN prep adposition conjunction, subordinating or preposition
storage NOUN NN compound noun noun, singular or mass
facilities NOUN NNS pobj noun noun, plural
dedicated VERB VBN acl verb verb, past participle
to ADP IN prep adposition conjunction, subordinating or preposition
hydrogen NOUN NN compound noun noun, singular or mass
- PUNCT HYPH punct punctuation punctuation mark, hyphen
storage NOUN NN pobj noun noun, singular or mass
我想识别一个给定的单词(例如storage)前面有一个名词(例如gas storage),以便用连字符替换空格字符(就像在氢气存储中已经做过的那样),但是当我的单词前面有一个非名词的POS元素(例如:into storage)时,我不想改变空格字符。
预期产量:"将现有地下储气设施改造为专用储氢设施";
是否有有效的方法来做到这一点?
提前感谢您的帮助
space提供了一个基于规则的匹配器。它允许你定义规则来查找像名词后面跟着名词这样的模式。
from spacy.matcher import Matcher
pattern = [{"POS": "NOUN"}, {"POS": "NOUN"}]
matcher = Matcher(nlp.vocab)
matcher.add("MultiWordExpression", [pattern])
…您可以使用它来提取匹配序列(这几乎是来自spaCy文档的逐字逐句):
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
文本的输出是
8584524718281925236 MultiWordExpression 4 6 gas storage
8584524718281925236 MultiWordExpression 5 7 storage facilities
8584524718281925236 MultiWordExpression 8 10 storage facilities
现在还有使用retokenizer.merge
方法合并令牌的功能,但是在这种情况下不起作用-参见下面的.
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
retokenizer.merge(doc[start:end])
在您的情况下,有重叠的跨度("气体储存"以及"储存设施";重叠),从而产生ValueError: [E102] Can't merge non-disjoint spans.
。如果你想使用空格,你必须确保你只得到非重叠的span,例如,通过改变模式为"一个名词,后面跟着一个单数名词"。([{"POS": "NOUN"}, {"TAG": "NN"}]
),这将工作并给出以下结果:
>>> for tok in doc:
>>> print(tok)
conversion
of
existing
underground
gas storage # <- The match is now one token
facilities
into
storage
facilities
dedicated
to
hydrogen
-
storage
如果您只需要字符串,我建议使用上面演示的匹配器来查找跨度,然后使用自定义函数根据这些跨度合并标记,这应该比内置的重新标记器更灵活。