如何只对多个输入文件使用一个映射?因为Hadoop为一个文件创建一个映射器。我只需要一个映射器即可处理所有文件。
我试图使用CombineFileInputFormat
.它有一个映射器,但映射输入仅包含一个文件。我需要该输入映射值来包含来自所有文件(文本格式)的数据,如下所示:
输入地图值 :
文件1中的数据.txt
来自文件2的数据.txt
来自文件3的数据.txt
public class WholeFileInputFormat extends CombineFileInputFormat<NullWritable, Text> {
public WholeFileInputFormat() {
super();
setMaxSplitSize(67108864);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<NullWritable, Text> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
if (!(split instanceof CombineFileSplit)) {
throw new IllegalArgumentException("split must be a CombineFileSplit");
}
RecordReader<NullWritable, Text> r = new CombineFileRecordReader<NullWritable, Text>((CombineFileSplit) split, context, WholeFileRecordReader.class);
return r;
//return null;
}
}
public class WholeFileRecordReader extends RecordReader<NullWritable, Text> {
private final Text mFileText;
public WholeFileRecordReader(CombineFileSplit fileSplit, TaskAttemptContext context,
Integer pathToProcess) throws IOException {
mProcessed = false;
mFileToRead = fileSplit.getPath(pathToProcess);
mFileLength = fileSplit.getLength(pathToProcess);
mConf = context.getConfiguration();
assert 0 == fileSplit.getOffset(pathToProcess);
FileSystem fs = FileSystem.get(mConf);
assert fs.getFileStatus(mFileToRead).getLen() == mFileLength;
// mFileName = new Text();
mFileText = new Text();
}
@Override
public void close() throws IOException {
mFileText.clear();
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
return mFileText;
}
@Override
public float getProgress() throws IOException, InterruptedException {
return (mProcessed) ? (float) 1.0 : (float) 0.0;
}
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
// no-op.
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!mProcessed) {
if (mFileLength > (long) Integer.MAX_VALUE) {
throw new IOException("File is longer than Integer.MAX_VALUE.");
}
byte[] contents = new byte[(int) mFileLength];
FileSystem fs = mFileToRead.getFileSystem(mConf);
FSDataInputStream in = null;
try {
// Set the contents of this file.
in = fs.open(mFileToRead);
IOUtils.readFully(in, contents, 0, contents.length);
mFileText.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
mProcessed = true;
return true;
}
return false;
}
}
你能帮我吗?
映射器的数量不是由文件的数量驱动的,而是由包含这些文件的块的数量驱动的;因此,Hadoop将每个文件拆分为块,并为每个块创建一个映射器。请看一下这样的链接,以便更多地了解Hadoop如何选择映射器和化简器的数量。
如果你真的想要一个映射器,必须说mapred.map.tasks
设置这个参数是行不通的,因为这是Hadoop的提示,而不是强制性的参数。您可以尝试将块大小增加到非常高的数字...
无论如何,将单个映射器与Hadoop一起使用是没有意义的......您将错过数据的分布式处理,这是这种系统的优点之一。