使用MapReduce,如何修改以下单词计数代码,使其仅输出超过特定计数阈值的单词?(例如,我想添加一些键值对的过滤。)
输入:
ant bee cat
bee cat dog
cat dog
输出:假设计数阈值为2或更多
cat 3
dog 2
以下代码来自:http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+代码
public static class Map1 extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce1 extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
编辑:回复:关于输入/测试用例
输入文件("example.dat")和一个简单的测试用例("testcase")可在此处找到:https://github.com/csiu/tokens/tree/master/other/SO-26695749
编辑:
问题不在于代码。这是由于org.apache.hadoop.mapred
包之间的一些奇怪行为造成的。(使用mapred还是mapreduce包创建Hadoop作业更好?)。
要点:改用org.apache.hadoop.mapreduce
在reduce中收集输出之前,请尝试添加if语句。
if(sum >= 2)
output.collect(key, new IntWritable(sum));
您可以在Reduce1类中进行过滤:
if (sum>=2) {
output.collect(key. new IntWritable(sum));
}