在NLTK和stanford解析中,名词短语的头查找是根据NP的头查找规则进行的



一般来说,名词短语的词头是位于NP最右边的名词,如下图所示,它是父NP的词头。所以

<>之前根|年代___|________________________NP |___|_____________ || pp vp|____ |____ ____|___Np | Np | PRT___|_______ | | | |Dt jj nn nn在NNP VBD rp| | | | | | | |那棵来自印度的老橡树倒了之前

[40]:树("S",[树("NP",[树("NP",[树(DT,[的]),树("JJ",['老']),树("NN",["橡树"]),树("NN",['树']))),树("页",[树("在",['从']),树("NP",[树(NNP,[印度 '])])])]), 树("副总裁",[树("VBD",['了']),树(PRT,[树(RP, [' '])])])])

以下基于java实现的代码使用一个简单的规则来查找NP的头部,但我需要基于规则:

parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
def traverse(t):
    try:
        t.label()
    except AttributeError:
          return
    else:
        if t.label()=='NP':
            print 'NP:'+str(t.leaves())
            print 'NPhead:'+str(t.leaves()[-1])
            for child in t:
                 traverse(child)
        else:
            for child in t:
                traverse(child)

tree=Tree.fromstring(parsestr)
traverse(tree)
上面的代码给出了输出:

NP:[‘的’,‘老’,"橡树","树","从","印度")NPhead:印度NP:['The', 'old', 'oak', 'tree']NPhead:树NP(印度的):NPhead:印度

虽然现在它给出了正确的句子输出,但我需要合并一个条件,只有最右边的名词被提取为头,目前它不检查它是否是一个名词(NN)

print 'NPhead:'+str(t.leaves()[-1])

就像上面代码中的np头条件一样:

t.leaves().getrightmostnoun() 

Michael Collins的论文(附录A)包含Penn tree - bank的寻头规则,因此不一定只有最右边的名词是头。因此,上述条件应该包含这样的场景。

对于下列其中一个答案所给出的例子:

(NP (NP这个人)给(NP演讲))回家

主语的头名词是person,而NP的最后一个左节点是讲话的人。

在NLTK中有内置的字符串to Tree对象(http://www.nltk.org/_modules/nltk/tree.html),参见https://github.com/nltk/nltk/blob/develop/nltk/tree.py#L541。

>>> from nltk.tree import Tree
>>> parsestr='(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))'
>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i
... 
(NP
  (NP (DT The) (JJ old) (NN oak) (NN tree))
  (PP (IN from) (NP (NNP India))))
(NP (DT The) (JJ old) (NN oak) (NN tree))
(NP (NNP India))

>>> for i in Tree.fromstring(parsestr).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()
... 
['The', 'old', 'oak', 'tree', 'from', 'India']
['The', 'old', 'oak', 'tree']
['India']

请注意,最右边的名词并不总是NP的头名词,例如

>>> s = '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Carnac']), Tree('DT', ['the']), Tree('NN', ['Magnificent'])]), Tree('VP', [Tree('VBD', ['gave']), Tree('NP', [Tree('', [Tree('DT', ['a']), Tree('NN', ['talk'])])])])])])
>>> for i in Tree.fromstring(s).subtrees():
...     if i.label() == 'NP':
...             print i.leaves()[-1]
... 
Magnificent
talk

可以说,Magnificent仍然可以作为头名词。另一个例子是NP包含一个关系从句:

(NP (NP这个人)给(NP演讲))回家

主语的头名词是person,而NP the person that gave the talk的最后一个离开节点是talk

我正在寻找一个使用NLTK的python脚本来完成这项任务,偶然发现了这篇文章。这是我想到的解决办法。它有点嘈杂和武断,而且肯定不会总是选出正确的答案(例如,对于复合名词)。但是我想把它贴出来,以防它对其他人有帮助,有一个解决方案,大部分工作。

#!/usr/bin/env python
from nltk.tree import Tree
examples = [
    '(ROOT (S (NP (NP (DT The) (JJ old) (NN oak) (NN tree)) (PP (IN from) (NP (NNP India)))) (VP (VBD fell) (PRT (RP down)))))',
    "(ROOTn  (Sn    (NPn      (NP (DT the) (NN person))n      (SBARn        (WHNP (WDT that))n        (Sn          (VP (VBD gave)n            (NP (DT the) (NN talk))))))n    (VP (VBD went)n      (NP (NN home)))))",
    '(ROOT (S (NP (NN Carnac) (DT the) (NN Magnificent)) (VP (VBD gave) (NP ((DT a) (NN talk))))))'
]
def find_noun_phrases(tree):
    return [subtree for subtree in tree.subtrees(lambda t: t.label()=='NP')]
def find_head_of_np(np):
    noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']
    top_level_trees = [np[i] for i in range(len(np)) if type(np[i]) is Tree]
    ## search for a top-level noun
    top_level_nouns = [t for t in top_level_trees if t.label() in noun_tags]
    if len(top_level_nouns) > 0:
        ## if you find some, pick the rightmost one, just 'cause
        return top_level_nouns[-1][0]
    else:
        ## search for a top-level np
        top_level_nps = [t for t in top_level_trees if t.label()=='NP']
        if len(top_level_nps) > 0:
            ## if you find some, pick the head of the rightmost one, just 'cause
            return find_head_of_np(top_level_nps[-1])
        else:
            ## search for any noun
            nouns = [p[0] for p in np.pos() if p[1] in noun_tags]
            if len(nouns) > 0:
                ## if you find some, pick the rightmost one, just 'cause
                return nouns[-1]
            else:
                ## return the rightmost word, just 'cause
                return np.leaves()[-1]
for example in examples:
    tree = Tree.fromstring(example)
    for np in find_noun_phrases(tree):
        print "noun phrase:",
        print " ".join(np.leaves())
        head = find_head_of_np(np)
        print "head:",
        print head

对于问题和其他答案中讨论的示例,输出如下:

noun phrase: The old oak tree from India
head: tree
noun phrase: The old oak tree
head: tree
noun phrase: India
head: India
noun phrase: the person that gave the talk
head: person
noun phrase: the person
head: person
noun phrase: the talk
head: talk
noun phrase: home
head: home
noun phrase: Carnac the Magnificent
head: Magnificent
noun phrase: a talk
head: talk

最新更新