是否可以更改Spacy令牌的令牌拆分规则



(德语(spacy令牌仪不会在斜线,下划线或星号上分裂,这只是我所需要的(因此," der/die"会导致单个令牌(。

它确实在括号上分开,因此" DIES(UND(DAS"分为5个令牌。是否有一种(简单的(方法告诉默认令牌也不会在没有空间的两侧的字母封闭的括号上分开?

在为令牌定义的括号上的这些拆分是如何的?

括号上的拆分是在这一行中定义的,在该行中,它在两个字母之间分配在括号上:

https://github.com/explosion/spacy/blob/23ec07debd568f09c7c7c83b105648505564850f9f9f9f9f9f97ad4/spacy/spacy/lang/lang/lang/de/punctuation.pypunctuation.py#l18

没有简单的方法可以删除infix模式,但是您可以定义一个可以执行您想要的自定义令牌。一种方法是从 spacy/lang/de/punctuation.py复制infix定义并修改:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.lang.de.punctuation import _quotes
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
    infixes = (
        LIST_ELLIPSES
        + LIST_ICONS
        + [
            r"(?<=[{al}]).(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
            r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
            r'(?<=[{a}])[:<>=](?=[{a}])'.format(a=ALPHA),
            r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
            r"(?<=[{a}])([{q}][])(?=[{a}])".format(a=ALPHA, q=_quotes),
            r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
            r"(?<=[0-9])-(?=[0-9])",
        ]
    )
    infix_re = compile_infix_regex(infixes)
    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp = spacy.load('de')
nlp.tokenizer = custom_tokenizer(nlp)

最新更新