我是nlp的新手,我开始学习如何在spacy中训练自定义ner。
TRAIN_DATA = [
('what is the price of polo?', {'entities': [(21, 25, 'Product')]}),
('what is the price of ball?', {'entities': [(21, 25, 'Product')]}),
('what is the price of jegging?', {'entities': [(21, 28, 'Product')]}),
('what is the price of t-shirt?', {'entities': [(21, 28, 'Product')]}),
('what is the price of jeans?', {'entities': [(21, 26, 'Product')]}),
('what is the price of bat?', {'entities': [(21, 24, 'Product')]}),
('what is the price of shirt?', {'entities': [(21, 26, 'Product')]}),
('what is the price of bag?', {'entities': [(21, 24, 'Product')]}),
('what is the price of cup?', {'entities': [(21, 24, 'Product')]}),
('what is the price of jug?', {'entities': [(21, 24, 'Product')]}),
('what is the price of plate?', {'entities': [(21, 26, 'Product')]}),
('what is the price of glass?', {'entities': [(21, 26, 'Product')]}),
('what is the price of moniter?', {'entities': [(21, 28, 'Product')]}),
('what is the price of desktop?', {'entities': [(21, 28, 'Product')]}),
('what is the price of bottle?', {'entities': [(21, 27, 'Product')]}),
('what is the price of mouse?', {'entities': [(21, 26, 'Product')]}),
('what is the price of keyboad?', {'entities': [(21, 28, 'Product')]}),
('what is the price of chair?', {'entities': [(21, 26, 'Product')]}),
('what is the price of table?', {'entities': [(21, 26, 'Product')]}),
('what is the price of watch?', {'entities': [(21, 26, 'Product')]})
]
首次训练空白空间模型:
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp
start_training = train_spacy(TRAIN_DATA, 20)
保存我训练过的spacy模型:
# Saveing the trained model
start_training.to_disk("spacy_start_model")
我的问题是如何用新的训练数据更新保存的模型?新的训练数据:
TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]
有人能帮我解决这个问题并给我小费吗?提前感谢!
据我所知,您可以使用新的数据示例重新训练您的模型,但现在您将从现有模型开始,而不是从空白模型开始。
为了实现这一点,它将首先从train_spacy
方法中删除以下行,并可能接收模型作为参数:
nlp = spacy.blank('en') # create blank Language class
然后,要重新训练您的模型,而不是加载spacy空白模型并传递给您的训练方法,请使用load
方法加载现有模型,然后调用您的训练法(在此处阅读有关spacy保存/加载的更多信息(。
start_training = spacy.load("spacy_start_model")
最后一个建议是,在我的实践中,通过从现有的模型(如en_core_web_md
或en_core_web_lg
(重新训练spacy NER模型,添加我的自定义实体,我获得了比从spacy空白模型从头开始训练更好的结果。
所有集合:
- 方法更新
def train_spacy(data, iterations, nlp): # <-- Add model as nlp parameter
TRAIN_DATA = data
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
else:
ner = nlp.get_pipe('ner')
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
for text, annotations in TRAIN_DATA:
nlp.update(
[text], # batch of texts
[annotations], # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
sgd=optimizer, # callable to update weights
losses=losses)
print(losses)
return nlp
nlp = spacy.blank('en') # create blank Language class
start_training = train_spacy(TRAIN_DATA, 20, nlp)
- 重试您的模型
TRAIN_DATA_2 = [('Who is Chaka Khan?', {"entities": [(7, 17, 'PERSON')]}),
('I like London and Berlin.', {"entities": [(7, 13, 'LOC')]})]
nlp = spacy.load("spacy_start_model") # <-- Now your base model is your custom model
start_training = train_spacy(TRAIN_DATA_2, 20, nlp)
我希望这对你有用!