Java中的重复词频问题

我是Java和Stackoverflow的新手。我的最后一个问题结束了。这次我添加了一个完整的代码。我有一个4GB的大txt文件(vocab.txt)。它包含纯孟加拉语(unicode)单词。每个单词在换行中以其频率(中间有等号)表示。例如,

আমার=5 
তুমি=3
সে=4 
আমার=3 //duplicate of 1st word of with different frequency
করিম=8 
সে=7    //duplicate of 3rd word of with different frequency

可以看到，相同的单词多次出现，但频率不同。如何只保留一个单词(而不是多个重复的单词)，并将重复单词的所有频率相加。例如，上面的文件就像(output.txt)，

আমার=8   //5+3
তুমি=3
সে=11      //4+7
করিম=8

我已经使用HashMap来解决这个问题。但我想我在某个地方犯了错误。它运行并显示准确的数据输出文件，而不更改任何内容。

package data_correction;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.OutputStreamWriter;
import java.util.*;
import java.awt.Toolkit;
public class Main {
public static void main(String args[]) throws Exception { 
FileInputStream inputStream = null;
Scanner sc = null;
String path="C:\DATA\vocab.txt";
FileOutputStream fos = new FileOutputStream("C:\DATA\output.txt",true);

BufferedWriter bufferedWriter = new BufferedWriter(
new OutputStreamWriter(fos,"UTF-8"));
try {
System.out.println("Started!!");
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
line = line.trim();
String [] arr = line.split("=");
Map<String, Integer> map = new HashMap<>();
if (!map.containsKey(arr[0])){
map.put(arr[0],Integer.parseInt(arr[1]));
} 
else{
map.put(arr[0], map.get(arr[0]) + Integer.parseInt(arr[1]));
}
for(Map.Entry<String, Integer> each : map.entrySet()){
bufferedWriter.write(each.getKey()+"="+each.getValue()+"n"); 
}
}
bufferedWriter.close();
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}
System.out.print("FINISH");
Toolkit.getDefaultToolkit().beep();
}
}

感谢您的宝贵时间。

这应该可以实现您想要的更多eJava魔法:

public static void main(String[] args) throws Exception {
String separator = "=";
Map<String, Integer> map = new HashMap<>();
try (Stream<String> vocabs = Files.lines(new File("test.txt").toPath(), StandardCharsets.UTF_8)) {
vocabs.forEach(
vocab -> {
String[] pair = vocab.split(separator);
int value = Integer.valueOf(pair[1]);
String key = pair[0];
if (map.containsKey(key)) {
map.put(key, map.get(key) + value);
} else {
map.put(key, value);
}
}
);
}
System.out.println(map);
}

对于test.txt，取正确的文件路径。注意映射保存在内存中，所以这可能不是最好的方法。如有必要，用数据库支持的方法来替换地图。

相关内容

最新更新

热门标签：