使用 NLTK 和 Spacy 的 NLP 命名实体识别

我在NLTK和Spacy上都使用了NER作为以下句子，以下是结果：

"Zoni I want to find a pencil, a eraser and a sharpener"

我在Google Colab上运行了以下代码。

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
ex = "Zoni I want to find a pencil, a eraser and a sharpener"
def preprocess(sent):
sent = nltk.word_tokenize(sent)
sent = nltk.pos_tag(sent)
return sent
sent = preprocess(ex)
sent
#Output:
[('Zoni', 'NNP'),
('I', 'PRP'),
('want', 'VBP'),
('to', 'TO'),
('find', 'VB'),
('a', 'DT'),
('pencil', 'NN'),
(',', ','),
('a', 'DT'),
('eraser', 'NN'),
('and', 'CC'),
('a', 'DT'),
('sharpener', 'NN')]

但是当我在同一文本上使用空间时，它没有返回任何结果

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
text = "Zoni I want to find a pencil, a eraser and a sharpener"
doc = nlp(text)
doc.ents
#Output:
()

它只适用于某些句子。

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
# text = "Zoni I want to find a pencil, a eraser and a sharpener"
text = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'
doc = nlp(text)
doc.ents
#Output:
(European, Google, $5.1 billion, Wednesday)

如果有问题，请告诉我。

空间模型是统计的。因此，这些模型识别的命名实体依赖于训练这些模型的数据集。

根据spacy文档，命名实体是分配了名称的">真实世界对象"，例如，一个人、一个国家、产品或书名。

例如，名称Zoni并不常见，因此模型不会将该名称识别为命名实体(人员(。如果我在你的句子中将佐尼的名字改为威廉，空间承认威廉是一个人。

import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp('William I want to find a pencil, a eraser and a sharpener')
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)
#output
PERSON  |  William

人们会假设铅笔、橡皮擦和卷笔刀是对象，因此它们可能会被归类为产品，因为空间文档指出"对象"是产品。但你句子中的 3 个对象似乎并非如此。

我还注意到，如果在输入文本中找不到命名实体，则输出将为空。

import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp('Zoni I want to find a pencil, a eraser and a sharpener')
if not doc.ents:
print ('No named entities were recognized in the input text.')
else:
for entity in doc.ents:
print(entity.label_, ' | ', entity.text)

我不确定我是否理解您要进行的比较。在NLTK的第一个示例中，您正在查看句子中的POS标签。但是，在使用spaCy的第二个示例中，您正在查看命名实体。这是两回事。统计模型应始终为每个令牌提供一个POS标签(尽管有时可能不同(，但是对命名实体的识别(如"生活很复杂"的帖子中所解释的那样(取决于训练这些模型的数据集。如果模型"感觉"句子中没有命名实体，您将获得一个空的结果集。但为了获得公平的比较，您还应该显示 NLTK 找到的命名实体，并与之进行比较。

相反，如果您想比较POS标签，则可以使用spaCy运行以下命令：

for token in doc:
print(token.text, token.pos_, token.tag_)

相关内容

最新更新

热门标签：