Hadoop textinputformat每个文件只读取一行

我为hadoop 0.20.2编写了一个简单的map任务，输入数据集由44个文件组成，每个文件约为3-5MB。任何文件的每一行都具有int,int的格式。输入格式为默认的TextInputFormat，映射器的工作是将输入的Text解析为整数。

任务运行后，hadoop框架统计显示map任务的输入记录数只有44条。我试着调试，发现map方法的输入记录只是每个文件的第一行。

谁知道问题是什么，我在哪里可以找到解决方案?

提前谢谢你。

编辑1

输入数据由另一个map-reduce任务生成，输出格式为TextOutputFormat<NullWritable, IntXInt>。IntXInt的toString()方法应该给出一个字符串int,int。

编辑2

我的映射器如下所示

static class MyMapper extends MapReduceBas
  implements Mapper<LongWritable, Text, IntWritable, IntWritable> {
  public void map(LongWritable key,
                  Text value,
                  OutputCollector<IntWritable, IntWritable> output,
                  Reporter reporter) {
    String[] s = value.toString().split(",");
    IntXInt x = new IntXInt(s[0], s[1]);
    output.collect(x.firstInt(), x.secondInt());
  }
}

编辑3

我刚刚检查过，映射器实际上只为每个文件读取1行，而不是整个文件作为一个Text值。

InputFormat定义了如何将数据从文件读取到Mapper实例中。默认的TextInputFormat读取文本文件行。它为每条记录发出的键是所读行的字节偏移量(作为LongWritable)，值是该行的内容，直到结束的'n'字符(作为Text对象)。如果您有多行记录，每个记录由$字符分隔，您应该编写自己的InputFormat，将文件解析为按此字符分割的记录。

我怀疑您的映射器将所有文本作为输入并打印输出。你能展示你的Mapper类声明和Mapper函数声明吗?即

static class MyMapper extends Mapper <LongWritable,Text,Text,Text>{ 
    public void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //do your mapping here
    }
}

我想知道这一行是否有什么不同

相关内容

最新更新

热门标签：