MapReduce迭代tf idf计算的值

我正在尝试一个reducer，输入(键，值)对的格式如下：

关键字：word
value：file=frequency，其中"file"是包含该单词的文件，"frequency"是该单词在文件

减速器的输出是的(键，值)对

键：word＝file
value：该文件中该单词的tf idf

在计算tf idf 之前，这个公式需要我知道两件事

包含单词(即密钥)的文件数
该单词在文件中的单个频率

不知怎么的，我似乎必须循环两次values，一次是获取包含该单词的文件数量，另一次是处理tf idf。

下面的伪代码：

//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Note: key is a word, values are in the form of
// (filename=frequency)
// sum up the number of files containing a particular word
// for every filename=frequency in the value, compute tf-idf of this
// word in filename and output (word@filename, tfidf)
}
}

我读到不可能循环通过values两次。一种选择可能是使用"缓存"，我尝试过，但结果不稳定。

Text outputKey = new Text(); 
Text outputValue = new Text();
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
// Note: key is a word, values are in the form of
// (filename=frequency)
Map<String, Integer> tfs = new HashMap<>();
for (Text value: values) {
String[] valueParts = value.split("=");
tfs.put(valueParts[0], Integer.parseInt(valueParts[1])); //do the necessary checks here
}
int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter() 
double IDF = Math.log10((double)numDocs/tfs.keySet().size());
// for every filename=frequency in the value, compute tf-idf of this
// word in filename and output (word@filename, tfidf)
for (String file : tfs.keySet()) {
outputKey.set(key.toString()+"@"+file);
outputValue.set(new String(tfs.get(file)*IDF)); //you could also set the outputValue to be a DoubleWritable
context.write(outputKey, outputValue);
}
}
}

如果将tf定义为frequency / maxFrequency，则可以在第一个循环中找到maxFrequency，并相应地更改outputValue。

如果你想尝试单循环解决方案，你需要得到IDF，所以你需要得到输入values的数字。你可以在Java 8中使用来完成这个技巧

long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);

按照这篇文章中的建议，或者按照同一篇文章中不使用循环的其他建议(否则，你可以按照前面的答案)。

在这种情况下，您的代码将是(我没有尝试过)：

Text outputKey = new Text(); 
Text outputValue = new Text();
//calculate tf-idf of every word in every document)
public static class CalReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int numDocs = context.getInt("noOfDocuments"); //set this in the Driver, if you know it already, or set a counter in the mapper to get it here using getCounter() 
long DF = values.spliterator().getExactSizeIfKnown();
double IDF = Math.log10((double)numDocs/DF);            
// Note: key is a word, values are in the form of
// (filename=frequency)
for (Text value: values) {
String[] valueParts = value.split("=");
outputKey.set(key.toString()+"@"+valueParts[0]);
outputValue.set(new String(Integer.parseInt(valueParts[1]) * IDF);
context.write(outputKey, outputValue);
}           
}
}

这也将节省一些内存，因为您不需要额外的Map(如果它有效的话)。

EDIT：上面的代码假设您已经有了文件名每个单词的总频率，即同一个文件名不会多次出现在值中，但您可能需要检查它是否有效。否则，第二个解决方案将不起作用，因为您必须在第一个循环中对每个文件的总频率求和。

相关内容

最新更新

热门标签：