如何按所有类型的标点符号将文本文件划分为阵列列表



这是我到目前为止的代码:

import java.util.*;
import java.io.*;
public class Alice {
    public static void main(String[] args) throws IOException {
        /*
         * To put the text document into an ArrayList
         */
        Scanner newScanner = new Scanner(new File("ALICES ADVENTURES IN WONDERLAND.txt"));
        ArrayList<String> list = new ArrayList<String>();
        while (newScanner.hasNext()) {
            list.add(newScanner.next());
        }
        newScanner.close();
    }
}

我现在可以通过所有标点符号将文档分开,但是我仍然需要能够对文本中的单词进行字符串操作。帮助请

输入是整本爱丽丝和仙境书,我需要输出以这样的外观:

"这本书用于使用等。"

基本上所有单词都分开,所有标点符号都从文档中删除。

List <String> list = new ArrayList <> ();
Pattern wordPattern = Pattern.compile ("\w+");
try (BufferedReader reader = new BufferedReader (new FileReader ("ALICES ADVENTURES IN WONDERLAND.txt"))) {
    String line;
    while ((line = reader.readLine ()) != null) {
        Matcher matcher = wordPattern.matcher (line);
        while (matcher.find())
            list.add (matcher.group());
    }
}

您可以将p{Punct}. Regex的字符类用作定界符。以下给出以下输出。

代码

String regex = "\p{Punct}.";
String phrase = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";
Scanner scanner = new Scanner(phrase);
scanner.useDelimiter(Pattern.compile(regex));
List<String> list = new ArrayList<String>(); // <- Try also as much as possible to work with interfaces
while (scanner.hasNext()) {
    list.add(scanner.next());
}
list.forEach(System.out::println);
scanner.close();

结果

Lorem Ipsum is simply dummy text of the printing and typesetting industry
Lorem Ipsum has been the industry
 standard dummy text ever since the 1500s
when an unknown printer took a galley of type and scrambled it to make a type specimen book
It has survived not only five centuries
but also the leap into electronic typesetting
remaining essentially unchanged
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

最新更新