获取单词的五个连续组合



所以我试图得到单词的五个顺序。我有这个输入:

太平洋是地球上最大的海洋分区

输出应如下所示:

 Pacific
 Pacific Ocean
 Pacific Ocean is
 Pacific Ocean is the
 Pacific Ocean is the largest
 Ocean
 Ocean is
 Ocean is the
 Ocean is the largest
 Ocean is the largest of
 is
 is the
 is the largest
 is the largest of
 is the largest of the
 the
 the largest
 the largest of
 the largest of the
 the largest of the Earth's
 largest
 largest of
 largest of the
 largest of the Earth's
 largest of the Earth's oceanic
 of
 of the
 of the Earth's
 of the Earth's oceanic
 of the Earth's oceanic divisions
 the
 the Earth's
 the Earth's oceanic
 the Earth's oceanic divisions
 Earth's
 Earth's oceanic
 Earth's oceanic divisions
 oceanic
 oceanic divisions
 divisions

我的尝试:

public void getComb(String line) {
    String words[] = line.split(" ");
    int count = 0;
    for (int i = 0; i < words.length; i++) {
        String word = "";
        int m = i;
        while (count < 5) {
            count++;
            word += " " + words[m];
            System.out.println(word);
            m++;
        }
    }
}

但是输出是错误的!输出:

 Pacific
 Pacific Ocean
 Pacific Ocean is
 Pacific Ocean is the
 Pacific Ocean is the largest

如何解决?

使用嵌套的 for 循环而不是 while 循环,并在外部循环中前进起始词:

public static void getComb(String line) {
    String words[] = line.split(" ");
    for (int i = 0; i < words.length; i++) {
        String word = "";
        for (int w = i; w < ((i + 5 < words.length) ? (i + 5) : words.length); w++) {
            word += " " + words[w];
            System.out.println(word);
        }
    }
}

请注意内部 for 循环中条件中的((i + 5 < words.length) ? (i + 5) : words.length);需要它,以便在剩余的单词少于五个时不会访问数组之外的元素 - 没有它,你会得到一个ArrayIndexOutOfBoundsException

更改代码段count = 0的位置:

public void getComb(String line) {
    String words[] = line.split(" ");
    for (int i = 0; i < words.length; i++) {
        int count = 0;   // RESET COUNT
        String word = "";
        int m = i;
        while (count < 5 && m < words.length) { // NO EXCEPTION with 'm' limit
            count++;
            word += " " + words[m];
            System.out.println(word);
            m++;
        }
    }
}

正式地,您希望从字符串中找到大小为 1、2、3、4 和 5 的 n 元语法。Apache Lucene 库中的 ShingleFilter 类可用于此目的。来自JavaDoc:

瓦片过滤器从令牌流构造带状疱疹(令牌 n 元语法)。换句话说,它将令牌组合创建为单个令牌。 例如,句子"请将这句话分成带状疱疹"可以标记为带状疱疹"请分割"、"分割这个"、"这句"、"句子成"和"成带状疱疹"。

尝试以下方法。安迪内定的修改版

public void getComb(String line)
{
    String words[] = line.split(" ");
    for(int i=0;i<words.length;i++)
    {
        int count=0;   //******* RESET CONT *****//
        String word = "";
        int m=i;
        while(count<5 && m < 10)
        {
            count++;
            word += " "+words[m];
            System.out.println(word);
            m++;
        }
    }
}

最新更新