SpaCy 自定义 NER 模型:依赖项分析器训练错误



我试图使用spacy构建一个自定义NER模型。为实体构建模型后,有必要为依赖项分析器训练模型。 我尝试按照下面给出的Spacy网站上提供的示例代码进行操作:https://spacy.io/usage/training#tagger-parser

SpaCy 网站给出的训练数据的示例代码为:

TRAIN_DATA = [
(
"They trade mortgage-backed securities.",
{
"heads": [1, 1, 4, 4, 5, 1, 1],
"deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"],
},
)]

在此示例代码中,对于训练数据,有一个名为"heads">的标签。我对它到底是什么以及它在代码中的重要性不是很讲究。

我尝试在训练数据中运行没有"heads"标签的模型。 训练数据的示例为:

TRAIN_PARSER = ('Mr Manjunath who is in-charge of the motor at their Goa location.', {'deps': ['compound',    'ROOT',    'nsubj',    'relcl',    'prep',    'punct',    'pobj',    'prep',    'det',    'pobj',    'prep',    'poss', 'compound','pobj', 'punct']})

当我尝试在没有下面给出的 head 标签的情况下运行模型时:

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

# training data
TRAIN_DATA = TRAIN_PARSER

@plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
output_dir=("Optional output directory", "option", "o", Path),
n_iter=("Number of training iterations", "option", "n", int),
)
def main(model='model1', output_dir='model2', n_iter=74):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model)  # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en")  # create blank Language class
print("Created blank 'en' model")
# add the parser to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if "parser" not in nlp.pipe_names:
parser = nlp.create_pipe("parser")
nlp.add_pipe(parser, first=True)
# otherwise, get it, so we can add labels to it
else:
parser = nlp.get_pipe("parser")
# add labels to the parser
for _, annotations in TRAIN_DATA:
for dep in annotations.get('deps', []):
parser.add_label(dep)
# get names of other pipes to disable them during training
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes):  # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print("Losses", losses)
# test the trained model
test_text = "I like securities."
doc = nlp(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
# save model to output directory
if output_dir is not None:
output_dir = Path(output_dir)
if not output_dir.exists():
output_dir.mkdir()
nlp.to_disk(output_dir)
print("Saved model to", output_dir)
# test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
doc = nlp2(test_text)
print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc])
main(model='model1', output_dir='model2', n_iter=74)

我收到以下错误:

IndexError: list index out of range

有人可以向我解释一下,这里的确切问题是什么,我该如何解决?另外,如何为我的训练数据生成"头部"标签?

需要heads信息来标识树中令牌的直接"父级"是什么。例如,在

"I like London and Berlin.",
{
"heads": [1, 1, 1, 2, 2, 1],
"deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
},

单词I的头部位于索引 1,即单词like,并通过依赖nsubj连接到它。

有关该术语的更多信息,请参阅spaCy文档:https://spacy.io/usage/linguistic-features#navigating

最新更新