mapreduce程序是否在默认情况下消耗文件夹中的所有文件(输入数据集)?

大家好，Stackoverflow，

我运行了一个mapreduce代码来查找文件中唯一的单词。输入的数据集(文件)在HDFS的文件夹中。因此，在运行mapreduce程序时，我将文件夹的名称作为输入。

我没有意识到在同一个文件夹里还有另外两个文件。Mapreduce程序继续读取所有3个文件并给出输出。输出正常

这是mapreduce的默认行为吗?这意味着如果您指向一个文件夹而不仅仅是一个文件(作为输入数据集)，mapreduce会消耗该文件夹中的所有文件?我感到惊讶的原因是，在映射器中，没有代码读取多个文件。我知道驱动程序中的第一个参数args[0]是我给出的文件夹名称。

这是驱动代码:

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class DataSort {
     public static void main(String[] args) throws Exception {
/*
 * Validate that two arguments were passed from the command line.
 */
if (args.length != 2) {
  System.out.printf("Usage: StubDriver <input dir> <output dir>n");
  System.exit(-1);
}
Job job=Job.getInstance();
/*
 * Specify the jar file that contains your driver, mapper, and reducer.
 * Hadoop will transfer this jar file to nodes in your cluster running 
 * mapper and reducer tasks.
 */
job.setJarByClass(DataSort.class);
/*
 * Specify an easily-decipherable name for the job.
 * This job name will appear in reports and logs.
 */
job.setJobName("Data Sort");
/*
 * TODO implement
 */
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(ValueIdentityMapper.class);
job.setReducerClass(IdentityReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
/*
 * Start the MapReduce job and wait for it to finish.
 * If it finishes successfully, return 0. If not, return 1.
 */
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
  }
}

映射器代码:

import java.io.IOException;  
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class ValueIdentityMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
 @Override
  public void map(LongWritable key, Text value, Context context)
  throws IOException, InterruptedException {
    String line=value.toString();
    for (String word:line.split("\W+"))
    {
        if (word.length()>0)
        {
            context.write(new Text(word),new IntWritable(1));
        }
    }
 }

}

减速器代码:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class IdentityReducer extends Reducer<Text, IntWritable, Text, Text>    {
 @Override
 public void reduce(Text key, Iterable<IntWritable> values, Context context)
  throws IOException, InterruptedException {
    String word="";
    context.write(key, new Text(word));
  }
 }

这是mapreduce的默认行为吗?

不是mapreduce，只是你使用的InputFormat。

FileInputFormat API参考

setInputPaths(JobConf conf, Path... inputPaths)

将Path s数组设置为map-reduce作业的输入列表。

Path API参考

为FileSystem中的文件或目录命名。

所以，当你说

没有读取多个文件的代码

是的，确实有，只是不需要写出来。

Mapper<LongWritable, Text,正确处理指定InputFormat中所有文件的所有文件偏移

相关内容

最新更新

热门标签：