我的MapReduce应用会统计Hive表中字段值的使用情况。在包括/usr/lib/hadood
, /usr/lib/hive
和/usr/lib/hcatalog
目录中的所有jar之后,我设法从Eclipse中构建并运行它。它的工作原理。
在经历了许多挫折之后,我也成功地将其编译并作为Maven项目运行:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.bigdata.hadoop</groupId>
<artifactId>FieldCounts</artifactId>
<packaging>jar</packaging>
<name>FieldCounts</name>
<version>0.0.1-SNAPSHOT</version>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.hcatalog</groupId>
<artifactId>hcatalog-core</artifactId>
<version>0.11.0</version>
</dependency>
</dependencies>
</project>
要从命令行运行作业,必须使用以下脚本:
#!/bin/sh
export LIBJARS=/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar,/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar,/usr/lib/hive/lib/libfb303-0.9.0.jar,/usr/lib/hive/lib/jdo-api-3.0.1.jar,/usr/lib/hive/lib/antlr-runtime-3.4.jar,/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar,/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:.:/usr/lib/hcatalog/share/hcatalog/hcatalog-core.jar:/usr/lib/hive/lib/hive-exec-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/hive-metastore-0.12.0.2.0.6.1-101.jar:/usr/lib/hive/lib/libfb303-0.9.0.jar:/usr/lib/hive/lib/jdo-api-3.0.1.jar:/usr/lib/hive/lib/antlr-runtime-3.4.jar:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.1.jar:/usr/lib/hive/lib/datanucleus-core-3.2.2.jar
hadoop jar FieldCounts-0.0.1-SNAPSHOT.jar com.bigdata.hadoop.FieldCounts -libjars ${LIBJARS} simple simpout
现在Hadoop创建并启动下一个失败的作业,因为Hadoop找不到Map类:
14/03/26 16:25:58 INFO mapreduce.Job: Running job: job_1395407010870_0007
14/03/26 16:26:07 INFO mapreduce.Job: Job job_1395407010870_0007 running in uber mode : false
14/03/26 16:26:07 INFO mapreduce.Job: map 0% reduce 0%
14/03/26 16:26:13 INFO mapreduce.Job: Task Id : attempt_1395407010870_0007_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1720)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:721)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.FieldCounts$Map not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1626)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1718)
... 8 more
为什么会这样?Job jar包含所有类,包括Map:
jar tvf FieldCounts-0.0.1-SNAPSHOT.jar
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
0 Wed Mar 26 14:29:58 MSK 2014 com/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
[hdfs@localhost target]$ jar tvf FieldCounts-0.0.1-SNAPSHOT.jar
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/
121 Wed Mar 26 15:51:04 MSK 2014 META-INF/MANIFEST.MF
0 Wed Mar 26 14:29:58 MSK 2014 com/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/
0 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/
3992 Fri Mar 21 17:29:22 MSK 2014 hive-site.xml
4093 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts.class
2961 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Reduce.class
1621 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/TableFieldValueKey.class
4186 Wed Mar 26 14:29:58 MSK 2014 com/bigdata/hadoop/FieldCounts$Map.class
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/
0 Wed Mar 26 15:51:06 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/
1030 Wed Mar 26 14:28:22 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.xml
123 Wed Mar 26 14:30:02 MSK 2014 META-INF/maven/com.bigdata.hadoop/FieldCounts/pom.properties
怎么了?我应该把Map和Reduce类放在单独的文件中吗?
MapReduce代码:
package com.bigdata.hadoop;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;
import org.apache.hcatalog.mapreduce.*;
import org.apache.hcatalog.data.*;
import org.apache.hcatalog.data.schema.*;
import org.apache.log4j.Logger;
public class FieldCounts extends Configured implements Tool {
public static class Map extends Mapper<WritableComparable, HCatRecord, TableFieldValueKey, IntWritable> {
static Logger logger = Logger.getLogger("com.foo.Bar");
static boolean firstMapRun = true;
static List<String> fieldNameList = new LinkedList<String>();
/**
* Return a list of field names not containing `id` field name
* @param schema
* @return
*/
static List<String> getFieldNames(HCatSchema schema) {
// Filter out `id` name just once
if (firstMapRun) {
firstMapRun = false;
List<String> fieldNames = schema.getFieldNames();
for (String fieldName : fieldNames) {
if (!fieldName.equals("id")) {
fieldNameList.add(fieldName);
}
}
} // if (firstMapRun)
return fieldNameList;
}
@Override
protected void map( WritableComparable key,
HCatRecord hcatRecord,
//org.apache.hadoop.mapreduce.Mapper
//<WritableComparable, HCatRecord, Text, IntWritable>.Context context)
Context context)
throws IOException, InterruptedException {
HCatSchema schema = HCatBaseInputFormat.getTableSchema(context.getConfiguration());
//String schemaTypeStr = schema.getSchemaAsTypeString();
//logger.info("******** schemaTypeStr ********** : "+schemaTypeStr);
//List<String> fieldNames = schema.getFieldNames();
List<String> fieldNames = getFieldNames(schema);
for (String fieldName : fieldNames) {
Object value = hcatRecord.get(fieldName, schema);
String fieldValue = null;
if (null == value) {
fieldValue = "<NULL>";
} else {
fieldValue = value.toString();
}
//String fieldNameValue = fieldName+"."+fieldValue;
//context.write(new Text(fieldNameValue), new IntWritable(1));
TableFieldValueKey fieldKey = new TableFieldValueKey();
fieldKey.fieldName = fieldName;
fieldKey.fieldValue = fieldValue;
context.write(fieldKey, new IntWritable(1));
}
}
}
public static class Reduce extends Reducer<TableFieldValueKey, IntWritable,
WritableComparable, HCatRecord> {
protected void reduce( TableFieldValueKey key,
java.lang.Iterable<IntWritable> values,
Context context)
//org.apache.hadoop.mapreduce.Reducer<Text, IntWritable,
//WritableComparable, HCatRecord>.Context context)
throws IOException, InterruptedException {
Iterator<IntWritable> iter = values.iterator();
int sum = 0;
// Sum up occurrences of the given key
while (iter.hasNext()) {
IntWritable iw = iter.next();
sum = sum + iw.get();
}
HCatRecord record = new DefaultHCatRecord(3);
record.set(0, key.fieldName);
record.set(1, key.fieldValue);
record.set(2, sum);
context.write(null, record);
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
args = new GenericOptionsParser(conf, args).getRemainingArgs();
// To fix Hadoop "META-INFO" (http://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file)
conf.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName());
// Get the input and output table names as arguments
String inputTableName = args[0];
String outputTableName = args[1];
// Assume the default database
String dbName = null;
Job job = new Job(conf, "FieldCounts");
HCatInputFormat.setInput(job,
InputJobInfo.create(dbName, inputTableName, null));
job.setJarByClass(FieldCounts.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
// An HCatalog record as input
job.setInputFormatClass(HCatInputFormat.class);
// Mapper emits TableFieldValueKey as key and an integer as value
job.setMapOutputKeyClass(TableFieldValueKey.class);
job.setMapOutputValueClass(IntWritable.class);
// Ignore the key for the reducer output; emitting an HCatalog record as
// value
job.setOutputKeyClass(WritableComparable.class);
job.setOutputValueClass(DefaultHCatRecord.class);
job.setOutputFormatClass(HCatOutputFormat.class);
HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbName, outputTableName, null));
HCatSchema s = HCatOutputFormat.getTableSchema(job);
System.err.println("INFO: output schema explicitly set for writing:"
+ s);
HCatOutputFormat.setSchema(job, s);
return (job.waitForCompletion(true) ? 0 : 1);
}
public static void main(String[] args) throws Exception {
String classpath = System.getProperty("java.class.path");
System.out.println("*** CLASSPATH: "+classpath);
int exitCode = ToolRunner.run(new FieldCounts(), args);
System.exit(exitCode);
}
}
我发现问题是在MapReduce jar所在的目录权限中。这个jar是在一个普通用户(而不是hdfs用户)的主目录下构建的。只要这个MRD作业将其工作结果直接输出到Hive表中,它就应该在hdfs
用户下运行。如果这样的作业是在普通用户下运行,它没有权限将数据写入Hive表!
另一方面,CentOS中普通用户的主目录具有700权限。因此,当您在用户下运行hadoop jar ...
命令,而不是拥有这个主目录的用户访问MRD jar时,在Hadoop加载类的过程中会被拒绝。这就是为什么在hdfs
用户下,这个作业的结果是java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.bigdata.hadoop.MyMap not found
。
递归地将MRD jar构建的主目录权限从700更改为755,解决了这个问题。
然而,一个更重要的问题仍然存在:如何在普通用户下运行作业,使其具有将数据写入Hive表的权限?
我发现了以下内容
为嵌入Java webapp的客户端设置hadoop系统用户
允许我以预期的hadoop用户连接,但jar尚未上传或执行…ClassNotFound仍然