改进Spacy中自定义命名实体识别(NER)的召回



这是我发布的另一个问题的第二部分。然而,它们的差异足以成为单独的问题,但也可能是相关的。

上一个问题使用Spacy构建自定义命名实体识别,使用随机文本作为示例

我已经使用上一个问题中描述的方法构建了一个自定义命名实体识别(NER(。从这里开始,我刚刚从Spacy网站(在该网站的"命名实体识别器"下(复制了构建NER的方法https://spacy.io/usage/training#ner)

自定义NER工作,排序。如果我用句子标记文本,使单词变位(这样"草莓"就变成了"草莓"(,它就可以构成一个实体。然而,它仅止于此。它有时会拾取两个实体,但很少。

我能做些什么来提高它的准确性吗?

这是代码(我有这种格式的TRAIN_DATA,但对于食品

TRAIN_DATA = [
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
)

数据在的对象train_food中

import spacy
import nltk
nlp = spacy.blank("en")
#Create a built-in pipeline components and add them in the pipeline
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last =True)
else:
ner =nlp.get_pipe("ner")

##Testing for food
for _, annotations in train_food:
for ent in annotations.get("entities"):
ner.add_label(ent[2])

# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
model="en"
n_iter= 20
# only train NER
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
# reset and initialize the weights randomly – but only if we're
# training a new model

nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_food)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_food, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts,  # batch of texts
annotations,  # batch of annotations
drop=0.5,  # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
text = "mike went to the supermarket today. he went and bought a potatoes, carrots, towels, garlic, soap, perfume, a fridge, a tomato, tomatoes and tuna."

之后,使用文本作为示例,我运行了以下代码

def text_processor(text):
text = text.lower()
token = nltk.word_tokenize(text)
ls = []
for x in token:
p = lemmatizer.lemmatize(x)
ls.append(f"{p}")
new_text = " ".join(map(str,ls))
return new_text
def ner (text):
new_text = text_processor(text)
tokenizer = nltk.PunktSentenceTokenizer()
sentences = tokenizer.tokenize(new_text)
for sentence in sentences:
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.label_)
ner(text)

这导致

potato FOOD
carrot FOOD

运行以下代码

ner("mike went to the supermarket today. he went and bought garlic and tuna")

中的结果

garlic FOOD

理想情况下,我希望NER选择土豆、胡萝卜和大蒜。我能做什么吗?

谢谢

Kah

在训练模型时,您可以尝试一些信息检索技术,例如:

1-所有单词的小写

2-用同义词替换单词

3-删除停止字

4重写句子(可以使用反译自动完成,也就是翻译成阿拉伯语,然后再翻译回英语(

此外,考虑使用更好的模型,例如:

http://nlp.stanford.edu:8080/corenlp

https://huggingface.co/models

最新更新