MapReduce输出数组可写



我正试图在一个简单的MapReduce任务中从ArrayWritable获得输出。我发现了一些类似问题,但我无法在自己的代码中解决这个问题。所以我期待你的帮助。谢谢:)!

输入:包含一些句子的文本文件。

输出应为:

<Word, <length, number of same words in Textfile>>
 Example: Hello  5  2 

我在工作中得到的输出是:

hello WordLength_V01$IntArrayWritable@221cf05
test WordLength_V01$IntArrayWritable@799e525a

我认为问题出在IntArrayWritable的子类中,但我没有得到正确的更正来解决这个问题。顺便说一下,我们有Hadoop2.5,我使用以下代码来获得这个结果:

主要方法:

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word length V1");
    // Set Classes
    job.setJarByClass(WordLength_V01.class);
    job.setMapperClass(MyMapper.class);
    // job.setCombinerClass(MyReducer.class);
    job.setReducerClass(MyReducer.class);
    // Set Output and Input Parameters
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntArrayWritable.class);
    // Number of Reducers
    job.setNumReduceTasks(1);
    // Set FileDestination
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

映射器:

public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
    // Initialize Variables
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    // Map Method
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        // Use Tokenizer
        StringTokenizer itr = new StringTokenizer(value.toString());
        // Select each word
        while (itr.hasMoreTokens()) {
            word.set(itr.nextToken());
            // Output Pair
            context.write(word, one);
        }
    }
}

减速器:

public static class MyReducer extends Reducer<Text, IntWritable, Text, IntArrayWritable> {
    // Initialize Variables
    private IntWritable count = new IntWritable();
    private IntWritable length = new IntWritable();
    // Reduce Method
    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        // Count Words
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        count.set(sum);
        // Wordlength
        length.set(key.getLength());
        // Define Output
        IntWritable[] temp = new IntWritable[2];
        IntArrayWritable output = new IntArrayWritable(temp);
        temp[0] = count;
        temp[1] = length;
        // Output
        output.set(temp);
        context.write(key, new IntArrayWritable(output.get()));
    }
}

子类

public static class IntArrayWritable extends ArrayWritable {
    public IntArrayWritable(IntWritable[] intWritables) {
        super(IntWritable.class);
    }
    @Override
    public IntWritable[] get() {
        return (IntWritable[]) super.get();
    }
    @Override
    public void write(DataOutput arg0) throws IOException {
        for(IntWritable data : get()){
            data.write(arg0);
        }
    }
}   

我使用以下链接找到了解决方案:

  • 可写接口(hadoop.apache.org)
  • 类ArrayWritable(hadoop.apache.org)
  • stackoverflow.com(1)
  • stackoverflow.com(2)

我真的很感谢你的想法!

--------解决方案--------

新子类:

public static class IntArrayWritable extends ArrayWritable {
    public IntArrayWritable(IntWritable[] values) {
        super(IntWritable.class, values);
    }
    @Override
    public IntWritable[] get() {
        return (IntWritable[]) super.get();
    }
    @Override
    public String toString() {
        IntWritable[] values = get();
        return values[0].toString() + ", " + values[1].toString();
    }
}

新的减少方法:

public void reduce(Text key, Iterable<IntWritable> values,
            Context context) throws IOException, InterruptedException {
        // Count Words
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        count.set(sum);
        // Wordlength
        length.set(key.getLength());
        // Define Output
        IntWritable[] temp = new IntWritable[2];
        temp[0] = count;
        temp[1] = length;
        context.write(key, new IntArrayWritable(temp));
}

一切看起来都很完美。只需要在子类中再写一个方法printStrings(),该方法返回字符串而不是数组。内置的toString()将返回字符串数组,这就是它在输出中给出地址而不是值的原因。

public String printStrings() {
     String strings = "";
        for (int i = 0; i < values.length; i++) {
         strings = strings + " "+ values[i].toString();
       }
      return strings;
    }

最新更新