我正在尝试使用nltk在某些LaTeX文件中查找名词短语(NP)和动词短语(VP)。我的LaTeX文件包含很多数学。由于我是nltk的新手,我开始尝试从终端获取我需要的东西。所以例如我尝试了这句话:
让大小分别用 $s(n)$
和 $t(n)$
表示。
我尝试的代码:
>>> from nltk import sent_tokenize, word_tokenize, pos_tag
>>> text = "Let the sizes be denoted by $s(n)$ and $t(n)$ respectively."
>>> sents = sent_tokenize(text)
>>> tokens = word_tokenize(text)
>>> tagged_tokens = pos_tag(tokens)
所有这些都工作正常。 但是当我尝试这些时:
>>> from nltk.chunk import *
>>> from nltk.chunk.util import *
>>> from nltk.chunk.regexp import *
>>> from nltk import Tree
>>> gold_chunked_text = tagstr2tree(tagged_tokens)
我收到此错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/nltk/chunk/util.py", line 331, in tagstr2tree
for match in WORD_OR_BRACKET.finditer(s):
TypeError: expected string or buffer
知道问题出在哪里吗?
tagstr2tree()
函数需要一个字符串输入,但你已经给了它一个元组输出列表,pos_tag()
:
>>> from nltk import word_tokenize, pos_tag
>>> text = "Let the sizes be denoted by $s(n)$ and $t(n)$ respectively."
>>> tagged_text = pos_tag(word_tokenize(text))
>>> tagged_text
[('Let', 'NNP'), ('the', 'DT'), ('sizes', 'NNS'), ('be', 'VB'), ('denoted', 'VBN'), ('by', 'IN'), ('$', '$'), ('s', 'NNS'), ('(', 'CD'), ('n', 'NN'), (')', ':'), ('$', '$'), ('and', 'CC'), ('$', '$'), ('t', 'NN'), ('(', ':'), ('n', 'NN'), (')', ':'), ('$', '$'), ('respectively', 'RB'), ('.', '.')]
现在你看到pos_tag
可能没有给你你需要的东西,所以也许这是一种更好的标记方法:
>>> tagged_text = pos_tag(text.split())
>>> tagged_text
[('Let', 'NNP'), ('the', 'DT'), ('sizes', 'NNS'), ('be', 'VB'), ('denoted', 'VBN'), ('by', 'IN'), ('$s(n)$', 'NNP'), ('and', 'CC'), ('$t(n)$', 'NNP'), ('respectively.', 'NNP')]
回到 tagstr2tree,预期的输入如下所示:
'Let/NNP the/DT sizes/NNS be/VB denoted/VBN by/IN $s(n)$/NNP and/CC $t(n)$/NNP respectively./NNP'
要实现这一目标,请执行以下操作:
>>> " ".join(["{}/{}".format(word,pos) for word, pos in tagged_text])
以下是完整脚本:
>>> from nltk.chunk.util import tagstr2tree
>>> from nltk import word_tokenize, pos_tag
>>> text = "Let the sizes be denoted by $s(n)$ and $t(n)$ respectively."
>>> tagged_text = pos_tag(text.split())
>>> tagged_text_string = " ".join(["{}/{}".format(word,pos) for word, pos in tagged_text])
>>> tagstr2tree(tagged_text_string)
Tree('S', [('Let', 'NNP'), ('the', 'DT'), ('sizes', 'NNS'), ('be', 'VB'), ('denoted', 'VBN'), ('by', 'IN'), ('$s(n)$', 'NNP'), ('and', 'CC'), ('$t(n)$', 'NNP'), ('respectively.', 'NNP')])