如何使用正则表达式删除文件中的重复单词(单词不在行中)?



我想使用正则表达式从文件中删除所有重复的单词。

例如:

The university of Hawaii university began using began radio. 

输出:

The university of Hawaii began using radio. 

我写了这个正则表达式:

String regex = "\b(\p{IsAlphabetic}+)(\s+\1\b)+";

这是只删除一个又一个单词的单词。

例如:The university university of Hawaii Hawaii began using radio.

输出:The university of Hawaii began using radio.

我的正则表达式代码:

Filedir = new File("C:/Users/Arnoldas/workspace/uplo/"(;

String source = dir.getCanonicalPath() + File.separator + "Output.txt";
String dest = dir.getCanonicalPath() + File.separator + "Final.txt";
File fin = new File(source);
FileInputStream fis = new FileInputStream(fin);
BufferedReader in = new BufferedReader(new InputStreamReader(fis, "UTF-8"));
//FileWriter fstream = new FileWriter(dest, true);
OutputStreamWriter fstream = new OutputStreamWriter(new FileOutputStream(dest, true), "UTF-8");
BufferedWriter out = new BufferedWriter(fstream);
String regex = "\b(\p{IsAlphabetic}+)(\s+\1\b)+";
//String regex = "(?i)\b([a-z]+)\b(?:\s+\1\b)+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
String aLine;
while ((aLine = in.readLine()) != null) {
Matcher m = p.matcher(aLine);
while (m.find()) {
aLine = aLine.replaceAll(m.group(), m.group(1));
}
//Process each line and add output to *.txt file
out.write(aLine);
out.newLine();
out.flush();
}

您可以改用 Streams:

String s = "The university university of Hawaii Hawaii began using radio.";
System.out.println(Arrays.asList(s.split(" ")).stream().distinct().collect(Collectors.joining(" ")));

在此示例中,字符串沿空白拆分,然后转换为流。重复项用 distinct(( 删除,最后所有 ist 都用空格连接在一起。

但是这种方法在末尾的点上有一个问题。"收音机"和"收音机"是不同的词。

试试这个正则表达式:

b(w+)s+1b
Here b is a word boundary and 1 references the captured match of the first group.

来源:连续重复单词的正则表达式

你走在正确的轨道上,但如果在重复之间可能有文本 它必须循环完成(对于"开始...开始。。。开始"(。

String s = "The university of Hawaii university began using began radio.";
for (;;) {
String t = s.replaceAll("(?i)\b(\p{IsAlphabetic}+)\b(.*?)\s*\b\1\b",
"$1$2");
if (t.equals(s)) {
break;
}
s = t;
}

对于不区分大小写的替换:使用(?i)

这是非常低效的,因为正则表达式必须回溯。

只需将所有单词扔进Set.

// Java 9
Set<String> corpus = Set.of(s.split("\P{IsAlphabetic}+"));
// Older java:
Set<String> corpus = new TreeSet<>();
Collections.addAll(set, s.split("\P{IsAlphabetic}+"));
corpus.remove("");

评论后

  • 更正原始代码
  • 使用文件和路径的新样式I/O,但仍然没有流
  • 试用资源,自动关闭和关闭
  • 正则表达式仅用于查找带有可选空格的单词。使用集检查重复项。

    Path dir = Paths.get("C:/Users/Arnoldas/workspace/uplo");
    Path source = dir.resolve("Output.txt");
    String dest = dir.resolve("Final.txt");
    String regex = "(\s*)\b\(p{IsAlphabetic}+)\b";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    try (BufferedReader in = Files.newBufferedReader(source);
    BufferedWriter out = new BufferedWriter(dest)) {
    String line;
    while ((line = in.readLine()) != null) {
    Set<String> words = new HashSet<>();
    Matcher m = p.matcher(line);
    StringBuffer sb = new StringBuffer();
    while (m.find()) {
    boolean added = words.add(m.group(2).toLowerCase());
    m.appendReplacement(sb, added ? m.group() : "");
    }
    m.appendTail(sb);
    out.write(sb.toString());
    out.newLine();
    }
    }
    

最新更新