Java code or Oozie

我是Hadoop的新手，所以我对下一种情况下该怎么办有些怀疑。我有一个算法，它包括多个不同作业的运行，有时还包括一个作业的多个运行（在循环中）。

我应该如何实现这一点，使用Oozie，还是使用Java代码？我查看了Mahout代码，在ClusterIterator函数中发现了以下内容：

 public static void iterateMR(Configuration conf, Path inPath, Path priorPath, Path outPath, int numIterations)
               throws IOException, InterruptedException, ClassNotFoundException {
    ClusteringPolicy policy = ClusterClassifier.readPolicy(priorPath);
    Path clustersOut = null;
   int iteration = 1;
   while (iteration <= numIterations) {
      conf.set(PRIOR_PATH_KEY, priorPath.toString());
      String jobName = "Cluster Iterator running iteration " + iteration + " over priorPath: " + priorPath;
      Job job = new Job(conf, jobName);
      job.setMapOutputKeyClass(IntWritable.class);
      job.setMapOutputValueClass(ClusterWritable.class);
      job.setOutputKeyClass(IntWritable.class);
      job.setOutputValueClass(ClusterWritable.class);
      job.setInputFormatClass(SequenceFileInputFormat.class);
      job.setOutputFormatClass(SequenceFileOutputFormat.class);
      job.setMapperClass(CIMapper.class);
      job.setReducerClass(CIReducer.class);
      FileInputFormat.addInputPath(job, inPath);
      clustersOut = new Path(outPath, Cluster.CLUSTERS_DIR + iteration);
      priorPath = clustersOut;
      FileOutputFormat.setOutputPath(job, clustersOut);
      job.setJarByClass(ClusterIterator.class);
      if (!job.waitForCompletion(true)) {
         throw new InterruptedException("Cluster Iteration " + iteration + " failed processing " + priorPath);
      }
      ClusterClassifier.writePolicy(policy, clustersOut);
      FileSystem fs = FileSystem.get(outPath.toUri(), conf);
      iteration++;
      if (isConverged(clustersOut, conf, fs)) {
        break;
      }
    }
    Path finalClustersIn = new Path(outPath, Cluster.CLUSTERS_DIR + (iteration - 1) + Cluster.FINAL_ITERATION_SUFFIX);
    FileSystem.get(clustersOut.toUri(), conf).rename(clustersOut, finalClustersIn);
   }

因此，他们有一个运行MR作业的循环。这是一个好方法吗？我知道Oozie用于DAG，也可以与其他组件（如Pig）一起使用，但我应该考虑将其用于类似的事情吗？

如果我想多次运行聚类算法，比如说对于聚类（使用特定的驱动程序），我应该在循环中执行，还是使用Oozie。

感谢

如果您只想运行map reduce作业，那么您可以考虑以下方法

使用Map reduce作业控制API链接MR作业

http://hadoop.apache.org/docs/r2.5.0/api/org/apache/hadoop/mapreduce/lib/jobcontrol/JobControl.html

从单个驱动程序类提交多个MR作业。

Job job1=新作业（getConf（））；job.waitForCompletion（true）；

if（job.isSuccessful（））{//使用不同的Mapper启动另一个作业。
```
//change config
Job job2 = new Job( getConf() );
```
}

如果你有一个复杂的DAG或涉及多个生态系统工具，如hive、pig，那么Oozie很适合。

相关内容

最新更新

热门标签：