如何使用spaCy预测多个句子的实体

我已经使用spaCy训练了一个ner模型。我知道如何使用它来识别单个句子(文档对象(的实体并可视化结果：

doc = disease_blank('Example sentence')    
spacy.displacy.render(doc, style="ent", jupyter=True)

或

for ent in doc.ents:
print(ent.text, ent.label_)

现在我想预测多个这样的句子的实体。我的想法是根据句子的实体来过滤它们。目前我刚刚找到了以下方法：

sentences = ['sentence 1', 'sentence2', 'sentence3']
for element in sentences:
doc = nlp(element)
for ent in doc.ents:
if ent.label_ == "LOC":
print(doc)
# returns all sentences which have the entitie "LOC"

我的问题是，是否有更好、更有效的方法来做到这一点？

您有两个选项，可以加快当前的实现速度：

在此处使用spaCy开发人员提供的提示。在不知道自定义NER模型管道有哪些特定组件的情况下，代码的重构会希望：

import spacy
import multiprocessing
cpu_cores = multiprocessing.cpu_count()-2 if multiprocessing.cpu_count()-2 > 1 else 1
nlp = spacy.load("./path/to/your/own/model")
sentences = ['sentence 1', 'sentence2', 'sentence3']
for doc in nlp.pipe(sentences, n_process=cpu_cores):  # disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"] ... if your model has them. Check with `nlp.pipe_names`
# returns all sentences which have the entitie "LOC"
print([(doc) for ent in doc.ents if ent.label_ == "LOC"])

结合前面的知识，使用spaCy自定义组件(如这里仔细解释的(。使用此选项，重构/改进后的代码将如下所示：

import spacy
import multiprocessing
from spacy.language import Language
cpu_cores = multiprocessing.cpu_count()-2 if multiprocessing.cpu_count()-2 > 1 else 1
@Language.component("loc_label_filter")
def custom_component_function(doc):
old_ents = doc.ents
new_ents = [item for item in old_ents if item.label_ == "LOC"]
doc.ents = new_ents
return doc

nlp = spacy.load("./path/to/your/own/model")
nlp.add_pipe("loc_label_filter", after="ner")
sentences = ['sentence 1', 'sentence2', 'sentence3']
for doc in nlp.pipe(sentences, n_process=cpu_cores):
print([(doc) for ent in doc.ents])

重要信息：

请注意，如果您的sentences变量包含数百或数千个样本，这些结果将非常明显；如果句子是"小">(即，它只包含一百句或更少的句子(，您(和时间基准(可能不会注意到有太大的差异
还请注意，nlp.pipe中的batch_size参数也可以进行微调，但根据我自己的经验，只有在使用前面的提示时，你仍然看不到明显的差异，你才想这样做

相关内容

最新更新

热门标签：