我一直在尝试用以下命令在集群上启动MapReduce作业:
bin/hadoop jar myjar.jar MainClass /user/hduser/input /user/hduser/output
但我一次又一次地收到以下错误,直到连接被拒绝:
13/08/08 00:37:16 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:54310. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
然后我检查了netstat,看看服务是否在监听正确的端口:
~> sudo netstat -plten | grep java
tcp 0 0 10.1.1.4:54310 0.0.0.0:* LISTEN 10022 38365 11366/java
tcp 0 0 10.1.1.4:54311 0.0.0.0:* LISTEN 10022 32164 11829/java
现在我注意到我的服务正在侦听端口10.1.1.4:54310,这是我的主机的IP,但"hadoop-jar"命令似乎正在连接到127.0.0.1(localhost,它是同一台机器),但因此找不到服务。有没有强制"hadoop-jar"查看10.1.1.4而不是127.0.0.1?
我的NameNode,DataNode,JobTracker,TaskTracker。。。都在运行。我甚至检查了从机上的DataNode和TaskTracker,一切似乎都正常。我可以检查master上的WebUI,它显示我的集群处于联机状态。
我认为问题与DNS有关,因为"hadoop-jar"命令似乎找到了正确的端口,但总是使用127.0.0.1地址而不是10.1.1.4
更新
core-site.xml中的配置
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://master:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred site.xml中的配置
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>
hdfs-site.xml中的配置
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
虽然这看起来是一个DNS问题,但实际上是Hadoop试图解决代码中对localhost的引用。我正在部署别人的罐子,并认为它是正确的。经过进一步检查,我发现了对localhost的引用,并将其更改为master,从而解决了我的问题。