如何在不使用OOzie的情况下创建Hadoop作业链



我想创建一个包含三个Hadoop作业的链,其中一个作业的输出作为第二个作业的输入,依此类推。我想在不使用Oozie的情况下做到这一点。

我编写了以下代码来实现它:-

public class TfIdf {
    public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException
    {
        TfIdf tfIdf = new TfIdf();
        tfIdf.runWordCount();
        tfIdf.runDocWordCount();
        tfIdf.TFIDFComputation();
    }
    public void runWordCount() throws IOException, InterruptedException, ClassNotFoundException
    {
        Job job = new Job();

        job.setJarByClass(TfIdf.class);
        job.setJobName("Word Count calculation");
        job.setMapperClass(WordFrequencyMapper.class);
        job.setReducerClass(WordFrequencyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.setInputPaths(job, new Path("input"));
        FileOutputFormat.setOutputPath(job, new Path("ouput"));
        job.waitForCompletion(true);
    }
    public void runDocWordCount() throws IOException, InterruptedException, ClassNotFoundException
    {
        Job job = new Job();
        job.setJarByClass(TfIdf.class);
        job.setJobName("Word Doc count calculation");
        job.setMapperClass(WordCountDocMapper.class);
        job.setReducerClass(WordCountDocReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, new Path("output"));
        FileOutputFormat.setOutputPath(job, new Path("ouput_job2"));
        job.waitForCompletion(true);
    }
    public void TFIDFComputation() throws IOException, InterruptedException, ClassNotFoundException
    {
        Job job = new Job();
        job.setJarByClass(TfIdf.class);
        job.setJobName("TFIDF calculation");
        job.setMapperClass(TFIDFMapper.class);
        job.setReducerClass(TFIDFReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, new Path("output_job2"));
        FileOutputFormat.setOutputPath(job, new Path("ouput_job3"));
        job.waitForCompletion(true);
    }
}

但是我收到错误:

Input path does not exist: hdfs://localhost.localdomain:8020/user/cloudera/output

谁能帮我解决这个问题?

这个答案来得有点晚了,但是...这只是您目录名称中的简单错别字。您已经将第一个作业的输出写入目录"ouput",并且您的第二个作业正在"输出"中查找它。

相关内容