根据前面的POS标记合并令牌



我想实现一些文本操作作为关键短语提取的预处理。请看下面的例子:

import spacy
text = "conversion of existing underground gas storage facilities into storage facilities dedicated to hydrogen-storage"
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
for token in doc:
print(f'{token.text:{8}} {token.pos_:{6}} {token.tag_:{6}} {token.dep_:{6}} {spacy.explain(token.pos_):{20}} {spacy.explain(token.tag_)}')

结果:

conversion NOUN   NN     ROOT   noun                 noun, singular or mass
of       ADP    IN     prep   adposition           conjunction, subordinating or preposition
existing VERB   VBG    amod   verb                 verb, gerund or present participle
underground ADJ    JJ     amod   adjective            adjective (English), other noun-modifier (Chinese)
gas      NOUN   NN     compound noun                 noun, singular or mass
storage  NOUN   NN     compound noun                 noun, singular or mass
facilities NOUN   NNS    pobj   noun                 noun, plural
into     ADP    IN     prep   adposition           conjunction, subordinating or preposition
storage  NOUN   NN     compound noun                 noun, singular or mass
facilities NOUN   NNS    pobj   noun                 noun, plural
dedicated VERB   VBN    acl    verb                 verb, past participle
to       ADP    IN     prep   adposition           conjunction, subordinating or preposition
hydrogen NOUN   NN     compound noun                 noun, singular or mass
-        PUNCT  HYPH   punct  punctuation          punctuation mark, hyphen
storage  NOUN   NN     pobj   noun                 noun, singular or mass

我想识别一个给定的单词(例如storage)前面有一个名词(例如gas storage),以便用连字符替换空格字符(就像在氢气存储中已经做过的那样),但是当我的单词前面有一个非名词的POS元素(例如:into storage)时,我不想改变空格字符。

预期产量:"将现有地下储气设施改造为专用储氢设施";

是否有有效的方法来做到这一点?

提前感谢您的帮助

space提供了一个基于规则的匹配器。它允许你定义规则来查找像名词后面跟着名词这样的模式。

from spacy.matcher import Matcher
pattern = [{"POS": "NOUN"}, {"POS": "NOUN"}]
matcher = Matcher(nlp.vocab)
matcher.add("MultiWordExpression", [pattern])

…您可以使用它来提取匹配序列(这几乎是来自spaCy文档的逐字逐句):

matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)

文本的输出是

8584524718281925236 MultiWordExpression 4 6 gas storage
8584524718281925236 MultiWordExpression 5 7 storage facilities
8584524718281925236 MultiWordExpression 8 10 storage facilities

现在还有使用retokenizer.merge方法合并令牌的功能,但是在这种情况下不起作用-参见下面的.

with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
retokenizer.merge(doc[start:end])

在您的情况下,有重叠的跨度("气体储存"以及"储存设施";重叠),从而产生ValueError: [E102] Can't merge non-disjoint spans.。如果你想使用空格,你必须确保你只得到非重叠的span,例如,通过改变模式为"一个名词,后面跟着一个单数名词"。([{"POS": "NOUN"}, {"TAG": "NN"}]),这将工作并给出以下结果:

>>> for tok in doc:
>>>     print(tok)
conversion
of
existing
underground
gas storage # <- The match is now one token
facilities
into
storage
facilities
dedicated
to
hydrogen
-
storage

如果您只需要字符串,我建议使用上面演示的匹配器来查找跨度,然后使用自定义函数根据这些跨度合并标记,这应该比内置的重新标记器更灵活。

最新更新