map reduce程序显示两个文件的交集

Map Reduce程序，它以两个文件为输入，并给出两个文件中的一组单词（两个文件的交集）

我试过了。。

Map函数：将文件作为输入，并给出（word，1）作为输出。。我在一个名为part-r-000000的文件中得到了这个输出。我对这两个文件都做了这一步，现在我有了两个文件（两个part-r-00000文件）

我如何将这些文件作为输入提供给Reduce函数。。

并给出了编写两个文件交集的reduce函数的一些建议。。

这是单词计数示例程序：

    package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
//import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCountMap {
  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
 /* public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  } */
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
   // job.setCombinerClass(IntSumReducer.class);
   // job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Reducer类在注释中，所有与Reducer类别相关的行都在注释中。但我仍然得到了一个文件part-r-0000.。输出是

海1这个1一个1是1是1检查1示例1示例1示例1公平1文件1ganesh 1hadoop 1如何1hpw 1为1为1为1地图1不是1只有1程序1.减少1所以1这个1这个1至1你1您1

您应该在驱动程序代码中提到job.setNumReduceTasks(0);。从而不会创建part-r-00000。

我已经这样测试过了。有了job.setNumReduceTasks(0);和没有Reducer逻辑，就生成了part-m-00000。没有job.setNumReduceTasks(0);和没有Reduceder逻辑，就产生了part-r-00000。

把这个放在上面，试着确认。

相关内容

最新更新

热门标签：