我是使用SpaCy
的新手。我们能告诉spacy API在给出令牌时忽略符号吗?
示例:
对于句子Hi, Welcome to StackOverflow.
,标记为
Hi
,
Welcome
to
StackOverflow
.
我希望spacy只为有空格的单词提供标记。对于上面的例子,令牌应该是
Hi,
Welcome
to
StackOverflow.
尝试:
import spacy
nlp = spacy.load("en_core_web_sm")
txt = "Hi, Welcome to StackOverflow."
doc = nlp(txt)
tokens = [tok.text for tok in doc if not tok.is_punct]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
您可能希望定义自己的标点符号列表:
punctuation = [".",",","!"]
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
['Hi', 'Welcome', 'to', 'StackOverflow']
或者使用string
包中现成的
from string import punctuation
print(punctuation)
doc_punct = nlp(" ".join([punctuation]))
tokens = [tok.text for tok in doc if tok.text not in punctuation]
print(tokens)
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
['Hi', 'Welcome', 'to', 'StackOverflow']