Spacy:如何在特殊大小写标记化规则列表中添加冒号

我有以下句子：

'25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?'

我想把冒号和其他单词分开。

默认情况下，以下是Spacy返回的内容：

print([w.text for w in nlp('25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?')])
['25', ')', 'Figure', '9:“lines', 'are', 'results', 'of', 'two', '-', 'step', 'adsorption', 'model', '”', '-', '>', 'What', 'method', '/', 'software', 'was', 'used', 'for', 'the', 'curve', 'fitting', '?']

我想得到的是：

['25', ')', 'Figure', '9', ':', '“', lines', 'are', 'results', 'of', 'two', '-', 'step', 'adsorption', 'model', '”', '-', '>', 'What', 'method', '/', 'software', 'was', 'used', 'for', 'the', 'curve', 'fitting', '?']

我试着做：

# Add special case rule
special_case = [{ORTH: ":"}]
nlp.tokenizer.add_special_case(":", special_case)

但没有结果，打印显示相同的值。

尝试使用compile_infix_regex:修改nlp.tokenizer.infix_finditer

import spacy
from spacy.util import compile_infix_regex
text = "'25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?'"

nlp = spacy.load("en_core_web_md")
infixes = (":",) + nlp.Defaults.infixes
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
doc = nlp(text)
for tok in doc:
print(tok, end =", ")

', 25, ), Figure, 9, :, “lines, are, results, of, two, -, step, adsorption, model, ”, -, >, What, method, /, software, was, used, for, the, curve, fitting, ?, ',

只需使用word_tokenize

from nltk.tokenize import word_tokenize 
string_my='25) Figure 9:“lines are results of two-step adsorption model” -> What method/software was used for the curve fitting?'
word_tokenize(string_my) 
['25', ')', 'Figure', '9', ':', '“', 'lines', 'are', 'results', 'of', 'two-step', 'adsorption', 'model', '”', '-', '>', 'What', 'method/software', 'was', 'used', 'for', 'the', 'curve', 'fitting', '?']

相关内容

最新更新

热门标签：