使Spacy令牌化器不在/上拆分

如何修改英文标记器以防止在'/'字符上拆分标记？

例如，以下字符串应该是一个令牌：


import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp("12/AB/568793")
for t in doc:
print(f"[{t.pos_} {t.text}]")
# produces
#[NUM 12]
#[SYM /]
#[ADJ AB/568793]

该方法是在"修改现有规则集"；来自Spacy文档：


nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
assert(len([x for x in infixes if '/' in x])==1)  # there seems to just be one rule that splits on /'s
# remove that rule; then modify the tokenizer
infixes = [x for x in infixes if '/' not in x]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

@Dave的回答是一个很好的起点，但我认为正确的方法是修改而不是删除规则。

nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
rule_slash = [x for x in infixes if '/' in x][0]
print(rule_slash)  # check the rule

您将看到该规则还涉及其他字符，包括"="、"<"、">'等

我们只从规则中删除"/"：

rule_slash_new = rule_slash.replace('/', '')
# replace the old rule with the new rule
infixes = [r if r!=rule_slash else rule_slash_new for r in infixes]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

这样，标记器在"；A＝B〃；或"；A>B"；等

相关内容

最新更新

热门标签：