我的数据在架构中有 20 个字段。就我的地图缩减程序而言,只有前三个字段对我来说很重要。如何减小映射器的输入大小,以便仅接收前三个字段。
1,2,3,4,5,6,7,8...20 columns in schema.
I want only 1,2,3 in the mapper to process it as offset and value.
注意我不能使用 PIG,因为其他一些地图缩减逻辑是在 MAP Reduce 中实现的。
自定义RecordReader
才能执行此操作:
public class TrimmedRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineReader;
private LongWritable lineKey;
private Text lineValue;
public TrimmedRecordReader(JobConf job, FileSplit split) throws IOException {
lineReader = new LineRecordReader(job, split);
lineKey = lineReader.createKey();
lineValue = lineReader.createValue();
}
public boolean next(LongWritable key, Text value) throws IOException {
if (!lineReader.next(lineKey, lineValue)) {
return false;
}
String[] fields = lineValue.toString().split(",");
if (fields.length < 3) {
throw new IOException("Invalid record received");
}
value.set(fields[0] + "," + fields[1] + "," + fields[2]);
return true;
}
public LongWritable createKey() {
return lineReader.createKey();
}
public Text createValue() {
return lineReader.createValue();
}
public long getPos() throws IOException {
return lineReader.getPos();
}
public void close() throws IOException {
lineReader.close();
}
public float getProgress() throws IOException {
return lineReader.getProgress();
}
}
它应该是不言自明的,只是LineRecordReader
的总结。不幸的是,要调用它,您还需要扩展InputFormat
。以下就足够了:
public class TrimmedTextInputFormat extends FileInputFormat<LongWritable, Text> {
public RecordReader<LongWritable, Text> getRecordReader(InputSplit input,
JobConf job, Reporter reporter) throws IOException {
reporter.setStatus(input.toString());
return new TrimmedRecordReader(job, (FileSplit) input);
}
}
只是不要忘记在驱动程序中设置它。
您可以在 map reduce 中实现自定义输入格式以单独读取必填字段。
仅供参考,以下博客文章解释了如何将文本作为段落阅读
http://blog.minjar.com/post/54759039969/mapreduce-custom-input-formats-reading