为流作业指定自己的输入格式

我定义了自己的输入格式如下，以防止文件拆分：

import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NSTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

我使用Eclipse将其编译成一个类NSTextInputFormat.class。我将这个类复制到启动作业的客户端。我使用以下命令启动作业，并将上面的类作为inputformat传递。

hadoop-jar$hadoop_HOME/hadoop-streaming.jar-Dmapred.job.queue.name=unfounded-输入24222910/framefile-输入24225109/framefile-输出-输入格式NSTextInputFormat-映射程序ExtractHSV-文件ExtractHSV-文件NSTextInputFormat.class-numReduceTasks 0

这不能说明：-inputformat:找不到类：NSTextInputFormat流作业失败！

我将PATH和CLASSPATH变量设置为包含NSTextInputFormat.class的目录，但这仍然不起作用。任何关于这方面的建议都会有所帮助。

如果您不熟悉Java，这里有一些gotcha可以帮助您。

-inputformat（以及其他需要类名的命令行选项）需要一个完全限定的类名，否则它需要在某个org.apache.hadoop...命名空间中找到该类。因此，您必须在.java文件中包含一个包名称

package org.example.hadoop;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapred.TextInputFormat;
public class NSTextInputFormat extends TextInputFormat {
    @Override
    protected boolean isSplitable(FileSystem fs, Path file) {
        return false;
    }
}

在命令行上指定全名：

-inputformat org.example.hadoop.NSTextInputFormat

构建jar文件时，.class文件也必须位于镜像包名称的目录结构中。我确信这是Java Packaging 101，但如果您使用Hadoop Streaming，那么您可能一开始就不太熟悉Java。将-d选项传递给javac将告诉它将输入文件编译为与包名称匹配的目录中的.class文件。

javac -classpath `hadoop classpath` -d ./output NSTextInputFormat.java

编译后的.class文件将被写入./output/org/example/hadoop/NSTextInputFormat.class。您将需要创建output目录，但其他子目录将为您创建。然后可以像这样创建jar文件：

jar cvf myjar.jar -C ./output/ .

你应该看到一些类似的输出：

added manifest
adding: org/(in = 0) (out= 0)(stored 0%)
adding: org/example/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/(in = 0) (out= 0)(stored 0%)
adding: org/example/hadoop/NSTextInputFormat.class(in = 372) (out= 252)(deflated 32%)

将输入格式和映射器类捆绑到一个jar（myjar.jar）中，并将-libjars myjar.jar选项添加到命令行：

hadoop jar $HADOOP_HOME/hadoop-streaming.jar 
   -libjars myjar.jar 
   -Dmapred.job.queue.name=unfunded \
   -input 24222910/framefile 
   -input 24225109/framefile 
   -output Output 
   -inputformat NSTextInputFormat 
   -mapper ExtractHSV 
   -numReduceTasks 0

相关内容

最新更新

热门标签：