将平均的perceptron标记pos转换为WordNet POS,并避免元组误差



我有用于使用NLTK的平均perceptron tagger的代码:

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)

结果:

[('dogs', 'NNS'), ('runs', 'VBZ'), ('fast', 'RB')]

我尝试过循环浏览每个标记的令牌并使用WordNet Lemmatizer进行lemmatizer:

lemmatizedWords = []
for w in tokensPOS:
       lemmatizedWords.append(WordNetLemmatizer().lemmatize(w))
print(lemmatizedWords)

结果错误:

Traceback (most recent call last):
  File "<ipython-input-30-462d7c3bdbb7>", line 15, in <module>
    lemmatizedWords = WordNetLemmatizer().lemmatize(w)
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkstemwordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkcorpusreaderwordnet.py", line 1712, in _morphy
    forms = apply_rules([form])
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkcorpusreaderwordnet.py", line 1692, in apply_rules
    for form in forms
  File "C:UserstacaAppDataLocalContinuumAnaconda3libsite-packagesnltkcorpusreaderwordnet.py", line 1694, in <listcomp>
    if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'

我认为我在这里有两个问题:

  1. POS标签未转换为WordNet可以理解的标签(我尝试实现与此答案WordNet lemmatization和Pos标记在Python中,没有成功(
  2. (
  3. 数据结构未正确形成以使每个元组循环(我在os相关代码之外找不到太多的错误(
  4. (

如何跟进使用lemmatization的POS标签以避免这些错误?

python解释器清楚地告诉您:

AttributeError: 'tuple' object has no attribute 'endswith'

tokensPOS是一个元组,因此您无法将其元素直接传递给lemmatize()方法(在此处查看WordNetLemmatizer类的代码(。只有字符串类型对象具有方法endswith(),因此您需要从tokenPOS中传递每个元组的第一个元素,就像这样:

lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0]))   

方法lemmatize()使用wordnet.NOUN作为默认POS。不幸的是,WordNet使用与其他NLTK Corpora不同的标签,因此您必须手动翻译它们(如所提供的链接中(,并将适当的标签用作lemmatize()的第二个参数。完整脚本,从此答案中使用方法get_wordnet_pos()

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''
string = 'dogs runs fast'
tokens = word_tokenize(string)
tokensPOS = pos_tag(tokens)
print(tokensPOS)
lemmatizedWords = []
for w in tokensPOS:
    lemmatizedWords.append(WordNetLemmatizer().lemmatize(w[0],get_wordnet_pos(w[1])))
print(lemmatizedWords)

最新更新