在空格中给出标记时忽略标点符号



我是使用SpaCy的新手。我们能告诉spacy API在给出令牌时忽略符号吗?

示例:

对于句子Hi, Welcome to StackOverflow.,标记为

Hi
,
Welcome
to
StackOverflow
.

我希望spacy只为有空格的单词提供标记。对于上面的例子,令牌应该是

Hi,
Welcome
to
StackOverflow.

尝试:

import spacy
nlp = spacy.load("en_core_web_sm")
txt = "Hi, Welcome to StackOverflow."
doc = nlp(txt)
tokens = [tok.text for tok in doc if not tok.is_punct]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']

您可能希望定义自己的标点符号列表:

punctuation = [".",",","!"]
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']

或者使用string包中现成的

from string import punctuation
print(punctuation)
doc_punct = nlp(" ".join([punctuation]))
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
['Hi', 'Welcome', 'to', 'StackOverflow']

最新更新