单词袋和文档术语矩阵是一回事吗?
我有一个由许多文件组成的训练数据集。我想把所有这些都读入一个数据结构(哈希图?)中,为科学、宗教、体育或性别等特定类别的文档创建一个单词包模型,为感知器的实现做准备。
现在我有最简单的Java I/o结构,即
String text;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((text = br.readLine()) != null)
{
//read in multiple files
//generate a hash map with each unique word
//as a key and the frequency with which that
//word appears as the value
}
所以我想做的是从一个目录中的多个文件读取输入,并将所有数据保存到一个底层结构中,如何做到这一点?我应该把它写在某个文件里吗?
我认为,根据我对单词袋的理解,正如我在上面代码的评论中所描述的那样,一个哈希图会起作用。是这样吗?我怎么能实现这样一个从多个文件中读取输入的sych呢。我应该如何存储它,以便以后将其纳入感知器算法?
我见过这样做:
String names = new String[]{"a.txt", "b.txt", "c.txt"};
StringBuffer strContent = new StringBuffer("");
for (String name : names) {
File file = new File(name);
int ch;
FileInputStream stream = null;
try {
stream = new FileInputStream(file);
while( (ch = stream.read()) != -1) {
strContent.append((char) ch);
}
} finally {
stream.close();
}
}
但这是一个蹩脚的解决方案,因为您需要提前指定所有文件,我认为这应该更动态。如果可能的话。
您可以尝试下面的程序,它是动态的,您只需要提供您的目录路径。
public class BagOfWords {
ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();
public static void main(String[] args) throws IOException {
File file = new File("F:/Downloads/Build/");
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException {
for (File f : file.listFiles()) {
if (f.isDirectory()) {
iterateDirectory(file);
} else {
// Read File
// Split and put it in a set
// add to map
}
}
}
}
我认为这很接近,但int
和integer
之间存在某种差异,如何协调?
ConcurrentHashMap>map=new ConcurrentHashMap>();
public static void main(String[] args) throws IOException
{
String path = "path";
File file = new File( path );
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
// Read File
// Split and put it in a set
// add to map
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!map.containsKey(word))
{
map.put(word, 0);
}
map.put(word, map.get(word) + 1);
}
}
}
}
}