NLTK 将子树转换为python / RSS提要分块中的列表

使用下面的代码，我正在对已经标记和标记化的rss提要进行分块。"print subtree.leaves(("输出：

[("Prime"， "NNP"(， ("部长"， "NNP"(， ("Stephen"， "NNP">

(， ("Harper"， "NNP"(][("美国"，"NNP"(，("总统"，"NNP"(，("巴拉克"，"NNP"(，("奥巴马"，"NNP"(][("什么"， "NNP"(][("Keystone"， "NNP"(， ("XL"， "NNP"(][("CBC"、"NNP"(、("新闻"、"NNP"(]

这看起来像一个 python 列表，但我不知道如何直接访问它或迭代它。我认为这是一个子树输出。

我希望能够将此子树转换为我可以操作的列表。有没有简单的方法可以做到这一点？这是我第一次在蟒蛇中遇到树木，我迷路了。我想以这个列表结束：

docs = ["总理斯蒂芬·哈珀"、"美国总统巴拉克·奥巴马"、"什么"、"Keystone XL"、"CBC 新闻"]

有没有简单的方法来实现这一点？

谢谢，一如既往的帮助！

grammar = r""" Proper: {<NNP>+} """
cp = nltk.RegexpParser(grammar)
result = cp.parse(posDocuments)
nounPhraseDocs.append(result) 
for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
# print the noun phrase as a list of part-of-speech tagged words
    print subtree.leaves()
print" "

node现在

已被label取代。所以修改维克托的答案：

docs = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'Proper'):
    docs.append(" ".join([a for (a,b) in subtree.leaves()]))

这将为您提供仅包含属于Proper夹头一部分的令牌的列表。您可以从 subtrees() 方法中删除 filter 参数，您将获得属于树的特定父级的所有令牌的列表。

docs = []
for subtree in result.subtrees(filter=lambda t: t.node == 'Proper'):
    docs.append(" ".join([a for (a,b) in subtree.leaves()]))
print docs

这应该可以解决问题。

相关内容

最新更新

热门标签：