NLTK RegEx Chunker未使用通配符捕获已定义的语法模式

我正试图使用NLTK的POS标记作为正则表达式来构建一个句子。根据句子中单词的标签，定义了2条规则来识别短语。

主要是，我想捕捉的大块，一个或多个动词后面跟着一个可选的限定词，然后是结尾的一个或更多名词。这是定义中的第一条规则。但它并没有被捕捉为短语块。

import nltk
## Defining the POS tagger 
tagger = nltk.data.load(nltk.tag._POS_TAGGER)

## A Single sentence - input text value
textv="This has allowed the device to start, and I then see glitches which is not nice."
tagged_text = tagger.tag(textv.split())
## Defining Grammar rules for  Phrases
actphgrammar = r"""
     Ph: {<VB*>+<DT>?<NN*>+}  # verbal phrase - one or more verbs followed by optional determiner, and one or more nouns at the end
     {<RB*><VB*|JJ*|NN*$>} # Adverbial phrase - Adverb followed by adjective / Noun or Verb
     """
### Parsing the defined grammar for  phrases
actp = nltk.RegexpParser(actphgrammar)
actphrases = actp.parse(tagged_text)

分块器tagged_text的输入如下。

tagged_text输出[7]：[（'This'，'DT'），（'has'，'VBZ'），（"允许"、"VBN"），（"，"DT"），（"设备"，"NN"），（"to"、"to"），（'start，'，'NNP'），（"and"，"CC"），（'I'，'PRP'），（'the'，'RB'），（"see"、"VB"），（'litches'，'NNS'），（'which'，'WDT'），（'is'，'VBZ'），（'not'，'RB'），（'nice'，'NNP'）]

在最终输出中，只捕获与第二条规则匹配的状语短语（"然后参见"）。我希望口头短语（"允许设备"）与第一条规则相匹配，也能被捕获，但事实并非如此。

actphrassOut[8]：树（'S'，[（'Is'，'DT'），（'has'，'VBZ'），（"允许"、"VBN"）、（"、"DT"）、，（'start，'，'NNP'），（'and'，'CC'），"I"，"PRP'"），树（'Ph'，[（'then'，'RB'），（'see'，'VB'）]），（'litches'，'NNS'），'VBZ'），（'not'，'RB'），【'nice'，'NNP'）】

使用的NLTK版本是2.0.5（Python 2.7）如有任何帮助或建议，我们将不胜感激。

提前感谢

巴拉。

关闭，但对正则表达式的微小更改将获得所需的输出。当您想要使用RegexpParser语法获得通配符时，您应该使用.*而不是*，例如VB.*而不是VB*:

>>> from nltk import word_tokenize, pos_tag, RegexpParser
>>> text = "This has allowed the device to start, and I then see glitches which is not nice."
>>> tagged_text = pos_tag(word_tokenize(text))    
>>> g = r"""
... VP: {<VB.*><DT><NN.*>}
... """
>>> p = RegexpParser(g); p.parse(tagged_text)
Tree('S', [('This', 'DT'), ('has', 'VBZ'), Tree('VP', [('allowed', 'VBN'), ('the', 'DT'), ('device', 'NN')]), ('to', 'TO'), ('start', 'VB'), (',', ','), ('and', 'CC'), ('I', 'PRP'), ('then', 'RB'), ('see', 'VBP'), ('glitches', 'NNS'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('nice', 'JJ'), ('.', '.')])

请注意，您正在捕获Tree(AdvP, [('then', 'RB'), ('see', 'VB')])，因为这些标签正是RB和VB。因此，在这种情况下，语法中的通配符（即""AdvP:｛｝""）将被忽略。

此外，如果是两种不同类型的短语，最好使用两个标签，而不是一个。而且（我认为）通配符后面的字符串结尾有点多余，所以最好是：

g = r"""
VP:{<VB.*><DT><NN.*>} 
AdvP: {<RB.*><VB.*|JJ.*|NN.*>}
"""

相关内容

最新更新

热门标签：