如何有效减少映射器的输入长度

我的数据在架构中有 20 个字段。就我的地图缩减程序而言，只有前三个字段对我来说很重要。如何减小映射器的输入大小，以便仅接收前三个字段。

1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.

注意我不能使用 PIG，因为其他一些地图缩减逻辑是在 MAP Reduce 中实现的。

您需要

自定义RecordReader才能执行此操作：

public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
   private LineRecordReader lineReader;
   private LongWritable lineKey;
   private Text lineValue;
   public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
      lineReader = new LineRecordReader(job, split);
      lineKey = lineReader.createKey();
      lineValue = lineReader.createValue();
   }
   public boolean next(LongWritable key, Text value) throws IOException {
      if (!lineReader.next(lineKey, lineValue)) {
          return false;
      }
      String[] fields = lineValue.toString().split(",");
      if (fields.length < 3) {
          throw new IOException("Invalid record received");
      }
      value.set(fields[0] + "," + fields[1] + "," + fields[2]);
      return true;
   }
   public LongWritable createKey() {
      return lineReader.createKey();
   }
   public Text createValue() {
      return lineReader.createValue();
   }
   public long getPos() throws IOException {
      return lineReader.getPos();
   }
   public void close() throws IOException {
      lineReader.close();
   }
   public float getProgress() throws IOException {
      return lineReader.getProgress();
   }
}

它应该是不言自明的，只是LineRecordReader的总结。不幸的是，要调用它，您还需要扩展InputFormat。以下就足够了：

public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
   public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
     JobConf job, Reporter reporter) throws IOException {
        reporter.setStatus(input.toString());
        return new TrimmedRecordReader(job, (FileSplit) input);
   }
}

只是不要忘记在驱动程序中设置它。

您可以在 map reduce 中实现自定义输入格式以单独读取必填字段。

仅供参考，以下博客文章解释了如何将文本作为段落阅读

http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading

相关内容

最新更新

热门标签：