一组字符的Spacy中缀

我正在尝试用Spacy训练Named Entity Recognition模型。作为其中的一部分，我需要将句子转换为Spacy模型的文档，以便进行训练和预测。以下是我使用的初始方法：

import spacy
# Taking a blank model
nlp = spacy.blank('xx')
# Convert a sentence to document
doc = nlp("Hafiz's e-book reader.")

但在打印了分段实体后，我得到了以下内容：

>>> print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
['Hafiz', "'s", 'e', '-', 'book', 'reader', '.'] # printing

我希望Spacy不被一些字符分割，即：["'", "-", "_"]。因此，我做了以下操作：

import spacy
nlp = spacy.blank('xx')
skip_on = ["'", "-", "_"]
infixes = nlp.Defaults.infixes
infixes = [x for x in infixes if not set(x).intersection(set(skip_on))] # Set intersection is done just to see if any of the desired characters exist
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

现在，在把它应用到同一句话之后，我得到了：

doc = nlp("Hafiz's e-book reader.")
>>> print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
['Hafiz', "'s", 'e-book', 'reader', '.'] # printing

我们可以看到，现在模型正确地理解了它不应该在连字符上分割(e-book保持不变(。

但问题是我不能得到撇号的相同行为(参见：Hafiz's变成了Hafiz和's(。如何解决此问题？

注意：我想输出以下内容：

["Hafiz's", 'e-book', 'reader', '.']

更新：它本质上是跳过对文本之间所有标点符号的拆分，而不仅仅是列出的标点符号和符号(["'", "-", "_"](：

doc = nlp("Hafiz's e-book reader.abc")
>>> print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])       
['Hafiz', "'s", 'e-book', 'reader.abc'] # printing

这里，reader.abc应该被拆分，因为.不在列表中。

参考：Spacy |修改现有规则集

您可以使用tokenizer.explain()来查看一些调试信息，这些信息是关于哪些令牌化器设置导致特定令牌化的：

import spacy
nlp = spacy.blank("xx")
skip_on = ["'", "-", "_"]
infixes = [x for x in nlp.Defaults.infixes if not set(x).intersection(set(skip_on))]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer
print(nlp.tokenizer.explain("Hafiz's"))
# [('TOKEN', 'Hafiz'), ('SUFFIX', "'s")]

相关内容

最新更新

热门标签：