单个单词的空间词形还原

我正在尝试获取单个单词的词形还原版本。有没有办法使用"spacy"(梦幻般的python NLP库(来做到这一点。

以下是我尝试过的代码，但这不起作用(：

from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)
word = "ducks"
lemmas = lemmatizer.lookup(word)
print(lemmas)

我希望的结果是"鸭子"(复数(这个词会产生"鸭子"(单数(。不幸的是，返回了"鸭子"(复数(。

有没有办法做到这一点？

注意：我意识到我可以处理文档(nlp(document((中的整个单词字符串，然后找到所需的标记，然后获取其引理(token.lemma_(，但是我需要词形还原的单词有些动态，无法作为大型文档进行处理。

如果要对单个标记进行词形还原，请尝试简化的文本处理库 TextBlob：

from textblob import TextBlob, Word
# Lemmatize a word
w = Word('ducks')
w.lemmatize()

输出

> duck

或NLTK

import nltk
from nltk.stem import SnowballStemmer
stemmer = nltk.stem.SnowballStemmer('english')
stemmer.stem('ducks')

输出

> duck

否则，您可以继续使用spaCy，但在禁用parser并NER管道组件后：

首先下载一个12M的小模型(在OntoNotes上训练的英语多任务CNN(

$ python -m spacy download en_core_web_sm

蟒蛇代码

import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) # just keep tagger for lemmatization
" ".join([token.lemma_ for token in nlp('ducks')])

输出

> duck

我认为您缺少使用 spaCy 数据库作为词形还原参考的部分。如果您看到我在下面对您的代码所做的修改，并提供了输出。duck是ducks的适当lemma_。

import spacy
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
lookups = Lookups()
lemmatizer = Lemmatizer(lookups)
word = "ducks"
#load spacy core database
nlp = spacy.load('en_core_web_sm')
#run NLP on input/doc
doc = nlp(word)
#Print formatted token attributes
print("Token Attributes: n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
# Print the text and the predicted part-of-speech tag
print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))

输出

Token Attributes: 
token.text, token.pos_, token.tag_, token.dep_, token.lemma_
ducks       NOUN        NNS         ROOT        duck

词形还原关键取决于令牌的词性。只有具有相同词性的标记才会映射到相同的引理。

在句子"这令人困惑"中，confusing被分析为形容词，因此将其词形还原为confusing。相比之下，在句子"我把你和别人混淆了"中，confusing被分析为动词，并被词形还原为confuse。

如果您希望将具有不同词性的标记映射到同一引理，则可以使用诸如Porter Stemming(Java(之类的词干提取算法，您可以简单地对每个标记调用该算法。

使用 NLTK，只需：

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('ducks')
'duck'

您可以通过以下方式使用 spacy 对单个单词进行词形还原：

nlp = spacy.load("en_core_web_lg")
lemmatizer = nlp.get_pipe("lemmatizer")
my_word = "lemmatizing"
lemmatizer.lemmatize(nlp(my_word)[0]) # this method accepts only token object

这将输出所有可能的引理：

['lemmatize', 'lemmatiz']

相关内容

最新更新

热门标签：