空格标记器不能始终将句点识别为后缀



我一直在研究一个自定义NER模型,以提取具有我无法控制的奇怪标识符的产品。

你可以从这个例子中看到,在某些情况下,它没有把句点作为后缀。我添加了一个自定义标记器来处理带有连字符的产品(如下)。我需要添加什么来处理这种情况,而不会危及其他现有的标记化?如有任何意见,将不胜感激。

issue_text = "I really like stereo receivers, I want to buy the new ASX8E11F." 
print(nlp_custom_ner.tokenizer.explain(issue_text))
issue_text = "I really like stereo receivers, I want to buy the new RK8BX." 
print(nlp_custom_ner.tokenizer.explain(issue_text))

[('TOKEN', 'I'), ('TOKEN', 'really'), ('TOKEN', 'like'), ('TOKEN', 'stereo'), ('TOKEN', 'receivers'), ('SUFFIX', ','), ('TOKEN', 'I'), ('TOKEN', 'want'), ('TOKEN', 'to'), ('TOKEN', 'buy'), ('TOKEN', 'the'), ('TOKEN', 'new'), ('TOKEN', 'ASX8E11F.')]
[('TOKEN', 'I'), ('TOKEN', 'really'), ('TOKEN', 'like'), ('TOKEN', 'stereo'), ('TOKEN', 'receivers'), ('SUFFIX', ','), ('TOKEN', 'I'), ('TOKEN', 'want'), ('TOKEN', 'to'), ('TOKEN', 'buy'), ('TOKEN', 'the'), ('TOKEN', 'new'), ('TOKEN', 'RK8BX'), ('SUFFIX', '.')]

我添加了一个自定义中缀标记器来处理带有连字符的产品。

import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("AXDR-PXXT-001")
print([t.text for t in doc])
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+-*^](?=[0-9-])",
r"(?<=[{al}{q}]).(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
# ✅ Commented out regex that splits on hyphens between letters:
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp("AXDR-PXXT-001")
print([t.text for t in doc])

['AXDR', '-', 'PXXT-001']
['AXDR-PXXT-001']

与修改中缀的示例类似,您需要查看当前后缀模式并编辑导致该后缀的规则。

对于这种特殊情况,可能是通用后缀规则中的这条规则:

r"(?<=[{au}][{au}]).".format(au=ALPHA_UPPER)