通过MapReduce读取匹配特定模式的目录中的文件，并输出各个文件的名称

我试图读取目录中的文件，该目录的路径被指定为MapReduce程序的参数。其目的是对每个文件执行一些计算(比如某个特定单词的出现次数)。此外，文件的名称必须匹配模式(例如.java文件)。程序的输出是文件名和计算值。

到目前为止，我已经能够实现一个非常基本的Map程序，它读取目录的内容，没有任何特定的模式，并输出文件的名称和一个常数。映射器代码看起来像这样

 public class CCMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    private static IntWritable complexityCount = new IntWritable(1);
    private Text result = new Text();
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
    {
        String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
        result.set(filePathString);
        context.write(result, complexityCount);
    }
 }

输入目录有3个文件:file1, file2, file3。但是这个程序的输出看起来像这样

file1.txt   1
file1.txt   1
file1.txt   1
file1.txt   1
file1.txt   1
file1.txt   1
file1.txt   1
file2.txt   1
file2.txt   1
file2.txt   1
file2.txt   1
file3.txt   1

如何让程序为每个文件输出一次出现?还有，是否有一种方法可以一次读取一个文件，对该文件执行计算并输出文件名和结果?如何修改InputSplit的值以匹配每个特定文件的大小?

我理解你的代码正在读取每个文件的内容。File1必须有7行，因此键值对是"File1.txt 1"，每行一次。同样，File2.txt必须有4行，File3.txt必须有1行。

要输出每个文件的一次出现，您必须在reduce函数中编写代码，根据键值对值进行求和。

  public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
  int sum = 0;
  for (IntWritable value : values) {
    sum += value.get();
  }
  context.write(key, new IntWritable(sum));
}

}

相关内容

最新更新

热门标签：