WordCount MapReduce给出了意想不到的结果



我正在尝试mapreduce中的wordcount java代码,完成reduce方法后,我想显示唯一出现最多次数的单词。

为此,我创建了一些类级变量,命名为myoutput, mykey和completeSum。

我在close方法中写入此数据,但最后我得到了意想不到的结果。

public class WordCount {
public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }
    }
}
static int completeSum = -1;
static OutputCollector<Text, IntWritable> myoutput;
static Text mykey = new Text();
public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        if (completeSum < sum) {
            completeSum = sum;
            myoutput = output;
            mykey = key;
        }

    }
    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub
        super.close();
        myoutput.collect(mykey, new IntWritable(completeSum));
    }
}
public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);
    conf.setMapperClass(Map.class);
    // conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    JobClient.runJob(conf);
}
}

输入文件数据

one 
three three three
four four four four 
 six six six six six six six six six six six six six six six six six six 
five five five five five 
seven seven seven seven seven seven seven seven seven seven seven seven seven 

结果应该显示为

six 18

然而,我得到这个结果

three 18

从结果中我可以看出总和是正确的,但键是错误的。

如果有人能给这些map和reduce方法很好的参考,那将非常有帮助。

您正在观察的问题是由于引用混叠。由key引用的对象被重用为多个调用的新内容,从而更改引用相同对象的mykey。最后得到最后一个约简键。这可以通过复制对象来避免,如:

mykey = new Text(key);

但是,由于static变量不能被分布式集群中的不同节点共享,因此只能从输出文件中获得结果。它只能在独立模式下工作,违背了map-reduce的目的。

最后,如果使用并行本地任务,即使在独立模式下使用全局变量,也很可能导致竞争(参见MAPREDUCE-1367和MAPREDUCE-434)。

相关内容

  • 没有找到相关文章