许多输入文件到单个地图.Hadoop.如何

如何只对多个输入文件使用一个映射？因为Hadoop为一个文件创建一个映射器。我只需要一个映射器即可处理所有文件。

我试图使用CombineFileInputFormat.它有一个映射器，但映射输入仅包含一个文件。我需要该输入映射值来包含来自所有文件（文本格式）的数据，如下所示：

输入地图值：

文件1中的数据.txt
来自文件2的数据.txt
来自文件3的数据.txt

public class WholeFileInputFormat extends CombineFileInputFormat<NullWritable, Text> {
public WholeFileInputFormat() {
    super();
    setMaxSplitSize(67108864);
}
@Override
protected boolean isSplitable(JobContext context, Path file) {
    return false;
}
@Override
public RecordReader<NullWritable, Text> createRecordReader(
        InputSplit split, TaskAttemptContext context) throws IOException {
    if (!(split instanceof CombineFileSplit)) {
        throw new IllegalArgumentException("split must be a CombineFileSplit");
    }
    RecordReader<NullWritable, Text> r = new CombineFileRecordReader<NullWritable, Text>((CombineFileSplit) split, context, WholeFileRecordReader.class);
    return r;
    //return null;
}
}

public class WholeFileRecordReader extends RecordReader<NullWritable, Text> {
private final Text mFileText;
public WholeFileRecordReader(CombineFileSplit fileSplit, TaskAttemptContext context,
                             Integer pathToProcess) throws IOException {
    mProcessed = false;
    mFileToRead = fileSplit.getPath(pathToProcess);
    mFileLength = fileSplit.getLength(pathToProcess);
    mConf = context.getConfiguration();
    assert 0 == fileSplit.getOffset(pathToProcess);
    FileSystem fs = FileSystem.get(mConf);
    assert fs.getFileStatus(mFileToRead).getLen() == mFileLength;
    //    mFileName = new Text();
    mFileText = new Text();
}
@Override
public void close() throws IOException {
    mFileText.clear();
}

@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
    return NullWritable.get();
}
@Override
public Text getCurrentValue() throws IOException, InterruptedException {
    return mFileText;
}
@Override
public float getProgress() throws IOException, InterruptedException {
    return (mProcessed) ? (float) 1.0 : (float) 0.0;
}
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
        throws IOException, InterruptedException {
    // no-op.
}

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
    if (!mProcessed) {
        if (mFileLength > (long) Integer.MAX_VALUE) {
            throw new IOException("File is longer than Integer.MAX_VALUE.");
        }
        byte[] contents = new byte[(int) mFileLength];
        FileSystem fs = mFileToRead.getFileSystem(mConf);
        FSDataInputStream in = null;
        try {
            // Set the contents of this file.
            in = fs.open(mFileToRead);
            IOUtils.readFully(in, contents, 0, contents.length);
            mFileText.set(contents, 0, contents.length);
        } finally {
            IOUtils.closeStream(in);
        }
        mProcessed = true;
        return true;
    }
    return false;
}
}

你能帮我吗？

映射器的数量不是由文件的数量驱动的，而是由包含这些文件的块的数量驱动的;因此，Hadoop将每个文件拆分为块，并为每个块创建一个映射器。请看一下这样的链接，以便更多地了解Hadoop如何选择映射器和化简器的数量。

如果你真的想要一个映射器，必须说mapred.map.tasks设置这个参数是行不通的，因为这是Hadoop的提示，而不是强制性的参数。您可以尝试将块大小增加到非常高的数字...

无论如何，将单个映射器与Hadoop一起使用是没有意义的......您将错过数据的分布式处理，这是这种系统的优点之一。

相关内容

最新更新

热门标签：