将平均的perceptron标记pos转换为WordNet POS，并避免元组误差

我有用于使用NLTK的平均perceptron tagger的代码：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

结果：

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我尝试过循环浏览每个标记的令牌并使用WordNet Lemmatizer进行lemmatizer：

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
print(lemmatizedWords)

结果错误：

Traceback (most recent call last):
  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkstemwordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkcorpusreaderwordnet.py", line 1712, in _morphy
    forms = apply_rules([form])
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkcorpusreaderwordnet.py", line 1692, in apply_rules
    for form in forms
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkcorpusreaderwordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'

我认为我在这里有两个问题：

POS标签未转换为WordNet可以理解的标签(我尝试实现与此答案WordNet lemmatization和Pos标记在Python中，没有成功(
数据结构未正确形成以使每个元组循环(我在os相关代码之外找不到太多的错误(

如何跟进使用lemmatization的POS标签以避免这些错误？

python解释器清楚地告诉您：

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一个元组，因此您无法将其元素直接传递给lemmatize()方法(在此处查看WordNetLemmatizer类的代码(。只有字符串类型对象具有方法endswith()，因此您需要从tokenPOS中传递每个元组的第一个元素，就像这样：

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))

方法lemmatize()使用wordnet.NOUN作为默认POS。不幸的是，WordNet使用与其他NLTK Corpora不同的标签，因此您必须手动翻译它们(如所提供的链接中(，并将适当的标签用作lemmatize()的第二个参数。完整脚本，从此答案中使用方法get_wordnet_pos()：

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
print(lemmatizedWords)

相关内容

最新更新

热门标签：