我有这个代码,它应该根据定义的语法显示句子的句法结构。然而,它正在返回一个空[]。我错过了什么或做错了什么?
import nltk
grammar = nltk.parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP
VP -> V NP | VP PP
N -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
P -> 'or' | 'and'
""")
def main():
sent = "Kim arrived or Dana left and everyone cheered".split()
parser = nltk.ChartParser(grammar)
trees = parser.nbest_parse(sent)
for tree in trees:
print tree
if __name__ == '__main__':
main()
让我们做一些逆向工程:
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
似乎规则甚至无法将第一个作品识别为NP。所以让我们尝试注入NP -> N
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP | N
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('NP', [Tree('N', ['Kim'])])]
现在它开始工作了,让我们继续Kim arrived or Dana and
:
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | VP PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>>
>>> sent = "Kim arrived or".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
似乎没有办法在有或没有P
的情况下获得VP
,因为V
在获得P
之前需要NP
,或者必须在树上成为VP
,所以放宽规则,说VP -> V PP
而不是VP -> VP PP
:
>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']), Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]
好吧,我们越来越近了,但似乎下一个词再次打破了cfg规则:
>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana left".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> sent = "Kim arrived or Dana left and".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>>
>>> sent = "Kim arrived or Dana left and everyone".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>>
>>> sent = "Kim arrived or Dana left and everyone cheered".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
所以我希望上面的例子向你表明,试图改变规则以从左到右融入语言现象是很困难的。
而不是从左到右,并实现
[[[[[[[[Kim] arrived] or] Dana] left] and] everyone] cheered]
你为什么不尝试制定更符合语言的规则来实现:
[[[Kim arrived] or [Dana left]] and [everyone cheered]]
[[Kim arrived] or [[Dana left] and [everyone cheered]]]
试试这个:
import nltk
grammar = nltk.parse_cfg("""
S -> CP | VP
CP -> VP C VP | CP C VP | VP C CP
VP -> NP V
NP -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
C -> 'or' | 'and'
""")
print "======= Kim arrived ========="
sent = "Kim arrived".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
print t
print "n======= Kim arrived or Dana left ========="
sent = "Kim arrived or Dana left".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
print t
print "n=== Kim arrived or Dana left and everyone cheered ===="
sent = "Kim arrived or Dana left and everyone cheered".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
print t
[out]
:
======= Kim arrived =========
(S (VP (NP Kim) (V arrived)))
======= Kim arrived or Dana left =========
(S (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left))))
=== Kim arrived or Dana left and everyone cheered ====
(S
(CP
(CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left)))
(C and)
(VP (NP everyone) (V cheered))))
(S
(CP
(VP (NP Kim) (V arrived))
(C or)
(CP
(VP (NP Dana) (V left))
(C and)
(VP (NP everyone) (V cheered)))))
上面的解决方案显示了CFG规则需要足够强大,不仅要捕获完整的句子,还要捕获部分句子。
语法中没有定义Det
,但每个NP
(以及相应的S
)都必须有一个语法定义。
与比较
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... NP -> Det N | Det N PP
... VP -> V NP | VP PP
... Det -> 'a' | 'the'
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... """)
>>>
>>> parser = nltk.ChartParser(grammar)
>>> parser.nbest_parse('the Kim left a Dana'.split())
[Tree('S', [Tree('NP', [Tree('Det', ['the']), Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['left']), Tree('NP', [Tree('Det', ['a']), Tree('N', ['Dana'])])])])]