如何编写用于提取名词短语的有效代码

我正在尝试使用诸如下面提到的规则在已POS标记的文本上提取短语

1） NNP ->

NNP （-> 表示后跟）2） NNP -> CC -> NNP3）副总裁 -> NP 等。。

我以这种方式编写代码，有人可以告诉我如何以更好的方式做。

    List<String> nounPhrases = new ArrayList<String>();
    for (List<HasWord> sentence : documentPreprocessor) {
        //System.out.println(sentence.toString());
        System.out.println(Sentence.listToString(sentence, false));
        List<TaggedWord> tSentence = tagger.tagSentence(sentence);

        String lastTag = null, lastWord = null;
        for (TaggedWord taggedWord : tSentence) {
            if (lastTag != null && taggedWord.tag().equalsIgnoreCase("NNP") && lastTag.equalsIgnoreCase("NNP")) {
                nounPhrases.add(taggedWord.word() + " " + lastWord);
                //System.out.println(taggedWord.word() + " " + lastWord);
            }
            lastTag = taggedWord.tag();
            lastWord = taggedWord.word();
        }
    }

在上面的代码中，我只对 NNP 做了，然后是 NNP 提取，我如何概括它以便我也可以添加其他规则。我知道有可用的库执行此操作，但想手动执行此操作。

也许你应该尝试使用Chunker。你可以试试OpenNLP Chunker。看起来您对 POS 使用相同的标记集。您可以在此处找到用法：

http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.chunker

输入示例：

Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.

输出：

[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ] [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ] [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ] [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.

大多数现有的库实现确实创建了一个有限状态机来实现此功能。它们可靠、高效且开放。但是，一个非常幼稚的实现想法可能是在 POS 标签数组上制定正则表达式，然后使用偏移量来标记短语。听起来合乎逻辑且简单，但可能不正确。

相关内容

最新更新

热门标签：