如何从没有IO标签的拥抱脸模型中提取完整的实体名称?

我使用的是拥抱脸的模型，特别是Davlan/distilbert-base-multilingual-cased-ner-hrl。但是，我无法从结果中提取完整的实体名称。

如果我运行以下代码:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Johnathan Smith and I work at Apple"
ner_results = nlp(example, aggregation_strategy="max")
print(ner_results)

然后我得到输出:

[{'entity': 'B-PER', 'score': 0.9998949, 'index': 4, 'word': 'Johna', 'start': 11, 'end': 16}, {'entity': 'I-PER', 'score': 0.999726, 'index': 5, 'word': '##tha', 'start': 16, 'end': 19}, {'entity': 'I-PER', 'score': 0.9997751, 'index': 6, 'word': '##n', 'start': 19, 'end': 20}, {'entity': 'I-PER', 'score': 0.99974835, 'index': 7, 'word': 'Smith', 'start': 21, 'end': 26}, {'entity': 'B-ORG', 'score': 0.99870986, 'index': 12, 'word': 'Apple', 'start': 41, 'end': 46}]

看起来我可以post process这个，所以Jonathan Smith都是一个单词。但理想情况下，我希望这是为我做的，没有部分单词被识别。

代码中有一个bug。聚合策略用错了地方。它应该是:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")
example = "My name is Johnathan Smith and I work at Apple"
ner_results = nlp(example)
print(ner_results)

给了:

[{'entity_group': 'PER', 'score': 0.99982166, 'word': 'Johnathan Smith', 'start': 11, 'end': 26}, {'entity_group': 'ORG', 'score': 0.99870986, 'word': 'Apple', 'start': 41, 'end': 46}]

相关内容

最新更新

热门标签：