用unicode撇号正确标记英语缩写



如果使用unicode撇号(而不是'(,如何修改默认的spacy(v3.0.5(标记器以正确拆分英文缩写。

import spacy
nlp = spacy.load('en_core_web_sm')
apostrophes = ["'",'u02B9', 'u02BB', 'u02BC', 'u02BD', 'u02C8', 'u02CA', 'u02CB', 'u0060', 'u00B4']
for apo in apostrophes:
text = f"don{apo}t"
print([t for t in nlp(text)])
>>> 
[do, n't]
[donʹt]
[donʻt]
[donʼt]
[donʽt]
[donˈt]
[donˊt]
[donˋt]
[don`t]
[don´t]

所有示例的期望输出为[do, n't]

我的最佳猜测是用所有可能的撇号变体来扩展默认的tokenizer_exceptions。但这不起作用,因为Tokenizer在特殊情况下不允许修改文本。

import spacy 
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
nlp = spacy.load('en_core_web_sm')
apostrophes = ['u02B9', 'u02BB', 'u02BC', 'u02BD', 'u02C8', 'u02CA', 'u02CB', 'u0060', 'u00B4']
default_rules = nlp.Defaults.tokenizer_exceptions
extended_rules = default_rules.copy()
for key, val in default_rules.items():
if "'" in key:
for apo in apostrophes:
extended_rules[key.replace("'", apo)] = val
rules = nlp.Defaults.tokenizer_exceptions
infix_re = compile_infix_regex(nlp.Defaults.infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
nlp.tokenizer =  spacy.tokenizer.Tokenizer(
nlp.vocab,
rules = extended_rules,
prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
)

apostrophes = ["'",'u02B9', 'u02BB', 'u02BC', 'u02BD', 'u02C8', 'u02CA', 'u02CB', 'u0060', 'u00B4']
for apo in apostrophes:
text = f"don{apo}t"
print([t for t in nlp(text)])
>>> ValueError: [E997] Tokenizer special cases are not allowed to modify the text. This would map ':`(' to ':'(' given token attributes '[{65: ":'("}]'.

您只需要在不更改文本的情况下添加一个异常。

import spacy 
nlp = spacy.load('en_core_web_sm')
from spacy.attrs import ORTH, NORM
case = [{ORTH: "do"}, {ORTH: "n`t", NORM: "not"}]
tokenizer = nlp.tokenizer
tokenizer.add_special_case("don`t", case)
doc =  nlp("I don`t believe in bugs")
print(list(doc))
# => [I, do, n`t, believe, in, bugs]

如果你想更改文本,你应该将其作为预处理步骤。

最新更新