space:如何在单词标记器中实现特殊的向后看?

我正在研究一个文本语料库，其中许多单独的标记包含像: - ) ( @这样的标点符号。例如:TMI-Cu(OH)。因此，我想自定义标记器，以避免在: - ) ( @上分裂，如果它们被数字/字母紧紧包围(没有空格)。

从这篇文章中，我了解到我可以修改infix_finditer来实现这一点。但是，如果)后面没有单词/数字，那么解决方案仍然在)上分裂，如示例所示:

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.,?:;...‘’`“”"'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
[token.text for token in doc0]

输出为['This', 'is', 'TMI-Cu(OH', ')', ',', 'and', 'somethig', 'else']，其中单个令牌TMI-Cu(OH)被拆分为两个令牌['TMI-Cu(OH', ')']。

是否有可能在标记器中实现"向后看"行为?因此，对于后跟非单词/非数字字符的元组')'，在对其进行拆分以生成新令牌之前，首先查看后面的')'和成对的'('之间是否存在空白。如果没有空格，则不要分割。

您需要从后缀中删除):

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''(?:[^ws]|_)(?<![-:@()])''') # Matching all special chars with your exceptions
suffixes = nlp.Defaults.suffixes
suffixes.remove(r')')   # Removing the `)` pattern from suffixes
suffix_re = compile_suffix_regex(suffixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = custom_tokenizer(nlp)
test_str0 = 'This is TMI-Cu(OH), and somethig else'
doc0 = nlp(test_str0)
print([token.text for token in doc0])

输出:

['This', 'is', 'TMI-Cu(OH)', ',', 'and', 'somethig', 'else']

请注意，我用于中缀匹配的(?:[^ws]|_)(?<![-:@()])正则表达式可以匹配除空格和-、:、@、(和)字符以外的任何特殊字符。

相关内容

最新更新

热门标签：