Hadoop MapReduce不处理/输出是错误的

我坚持Hadoop给我奇怪的输出或根本不处理MapReduce。即使它成功了，输出也是错误的，对我来说，代码似乎是正确的。我正在尝试做的是parse并count length of a string，我想在数据连接在一起时解析每个4 ";"符号（就像customerID;date;jobdescription;associations etc etc一个大字符串一样）。

这是我的代码：

映射：

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TwitterMapper extends Mapper<Object, Text, IntWritable, IntWritable> { 
    //private final IntWritable one = new IntWritable(1);
   // private Text  = new Text();
    private final IntWritable one = new IntWritable(1);
    private final IntWritable length = new IntWritable();

    public void map(Object key, Text value, Context context) 
                     throws IOException, InterruptedException {
      // Format per tweet is id;date;hashtags;tweet;
      String dump = value.toString();
      int startIndex = 1;
      if(StringUtils.ordinalIndexOf(dump, ";", 4) > -1){
          startIndex = StringUtils.ordinalIndexOf(dump,";",3) + 1;
          String tweet = dump.substring(startIndex,dump.lastIndexOf(';'));
          //data.set(tweet.length());
          one.set(tweet.length());
          context.write(one,length);
          //context.write(dump,length);
          //length.set(); 
      }
   }
}

还原剂：

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class TwitterReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
              throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values) {
            sum = sum + value.get();
        }
               result.set(sum);
        context.write(key, result);
    }
}

我得到的输出如下所示：

4 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 0

而我期望的输出是字符串每 4 个部分（对于每个客户）的字符数，这非常可变，因为输出是相同的。所以像这样：1 201922 192923 23890

等。

我认为问题是您错误地覆盖了map和reduce方法。这些方法的正确签名是：

public void map(Object key, Text value, Context context) throws IOException, InterruptedException
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

由于不正确的覆盖，您的方法（map，reduce）甚至不会被调用。

我还发现了其他一些错误：

在map方法中，您在context.write之前没有设置length，因此映射器输出中的值对于每个输入都是零。
如果要将IntWritable对写入输出，则化简器应扩展Reducer<IntWritable, IntWritable, IntWritable, IntWritable>。

您的程序现在正在做什么：

将输入线路"customerID;date;jobdescription;associations;"转换为一对"associations".length()和0。
化简器上的所有零值求和，并将一对"associations".length()和0写入输出。

相关内容

最新更新

热门标签：