空间自定义分词器，仅包含连字符单词作为使用中缀正则表达式的标记

我想在Spacy中包含带连字符的单词，例如：长期，自尊等作为单个标记。在查看了StackOverflow，Github，其文档和其他地方的一些类似帖子后，我还编写了一个自定义标记器，如下所示：

import re
from spacy.tokenizer import Tokenizer
prefix_re = re.compile(r'''^[[("']''')
suffix_re = re.compile(r'''[])"']$''')
infix_re = re.compile(r'''[.,?:;...‘’`“”"'~]''')
def custom_tokenizer(nlp):
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en_core_web_lg')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it's a male-dominated profession.')
[token.text for token in doc]

所以对于这句话： ">注：自十四世纪以来，"医学"的实践已经成为一种职业;更重要的是，这是一个男性主导的职业。

现在，合并自定义空间分词器后的令牌是：

"注"、"："， "自"、"自"、"十四世纪"、"世纪"、"实践"、"的"，"医学"，"，"有"，">成为"，"a"， "专业"、"、"和"、"更多"、"重要"、"、"，"这是"，"A"、">男性主导"、"职业"、"。

早些时候，此更改之前的令牌是：

"注意"、"："、"自"、"自"、"十四世纪"、"该"、"实践"、"的"、"医学"、"有"、"成为"、"a"、"职业"、";"、"和"、"更多"、"重要"、"、">

它"、">'s"、"a"、">男性"、">-"、">支配"、"职业"、"。

并且，预期的令牌应为：

"注意"、"："、"自"、"自"、"十四世纪"、"那个"、"实践"、"的"、"医学"、"有"、"成为"、"一个"、">

职业"、"和"、"更多"、"重要"、"、">它"、"''s'、'a'、'男性主导'、'职业'， '。">

摘要：正如人们所看到的...

包括连字符单词，除了双引号和撇号之外的其他标点符号也是如此......
。但是现在，撇号和双引号没有更早或预期的行为。
我已经为中缀的正则表达式编译尝试了不同的排列和组合，但没有解决此问题的进展。

使用默认的prefix_re和suffix_re给了我预期的输出：

import re
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
def custom_tokenizer(nlp):
infix_re = re.compile(r'''[.,?:;...‘’`“”"'~]''')
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=None)
nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)
doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it's a male-dominated profession.')
[token.text for token in doc]

['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

如果你想深入了解为什么你的正则表达式不像SpaCy那样工作，这里有相关源代码的链接：

此处定义的前缀和后缀：

https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

参考此处定义的字符(例如引号、连字符等(：

https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

以及用于编译它们的函数(例如，compile_prefix_regex(：

https://github.com/explosion/spaCy/blob/master/spacy/util.py

相关内容

最新更新

热门标签：